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Abstract 


Durisg  thii  put  jtu  we  here  concerned  ourietTee  with  the  lyntheiie  of  tree  itructurei.  These 
structures  offer,  in  our  opinion,  the  best  hope  of  tchieving  subpoljnomial  running  times  for  typi¬ 
cal  problems  without  a  degree  of  intercoimection  that  makes  physical  implementation  difficult. 

One  would  like  to  be  able  to  synthesise  trees  using  diyide  Sk  conquer.  Diride  &  conquer  is  an 
appealing  technique  for  tree  synthesis  because  of  the  isomorphism  between  the  shape  of  the 
desired  synthesised  system  and  the  recursire  descent  implicit  in  diride  Sk  conquer.  Additionally, 
the  technique  makes  good  use  of  theorem  proring  techniques  which  are  rapidly  being  developed 
for  other  purposes  (see  [Smlth-SS]).  Certain  problems  arise,  however,  when  one  tries  to  use 
divide  A  conquer  to  synthesise  a  tree-strncured  computing  system.  The  exact  characteristics 
of  the  problems  that  can  arise  fall  into  three  categories,  to  be  described  below,  but  'the  basic 
difficulty  is  that  nodes  that  are  high  in  the  tree  are  required  to  either  compute  or  communicate 
li  rge  amounts  of  data. 

Cur  primary  solution  to  this  problem  is  to  replace  the  original  specification,  which  in  general 
dsclares  the  existence  of  an  output  array  that  depends  on  various  elements  of  the  input  array, 
into  an  equivalent  specification  which  declares  the  existence  of  a  certain  cloture,  or  specialised 
functional  object,  together  with  a  declaration  that  it  be  applied.  'Constraints  are  imposed  on 
the  closure  so  that  application  of  this  closure  will  have  the  desired  effect.  We  show  that  closures 
c^n  be  computed  and  applied  rapidly,  in  time  0(log*n)  for  small,  constant  t  on  problems  of  sisc 
n,  even  in  many  cases  where  the  normal  results  of  divide  &  conquer  would  be  a  computation 
that  could  only  be  performed  in  time  O(n^)  for  scrictly  positive  constant  j. 

We  have  also  found  an  interesting  synthesis  path  for  several  binary  addition  circuits  that  uses 
this  technique  and  another  technique  called  guatUifier  levelling. 


Trees  of  Processors 


In  tl.is  report  we  examine  one  clasi  of  methods  for  producing  highly  concurrent  architectures. 
The:  e  architectures  are  Titai  to  meet  the  needs  for  sufficiently  fast  computation  to  make  certain 
protlems  practical.  Automatic  systems  for  the  synthesis  of  these  architectures  are  therefore 
important  because  hand  crafting  is  a  difficult,  expensiTe  and  error-prone  process.  In  this  report 
we  e'cplore  the  synthesis  of  tree-structured  architectures.  Other  architectures  have  been  explored 
in  prior  reports  ([Kiag-83],  [KlngBrown-83]). 

Trees  of  processors  can  be  used  to  efficiently  implement  many  specifications  because  the  tree  is 
that  topology  with  fixed  arity  and  lowest  connectivity  that  allows  a  distinguished  node  to  have 
contact  with  all  other  nodes  in  O(logn)  steps,  which  is  clearly  the  best  possible.  TRANSConS 
([King-83|)  therefore  has  facilities  for  specifying,  synthesising  and  manipulating  trees. 

The  description  of  a  tree  is  specified  in  TREE  declarations,  described  below.  Before  describing 
the  syntax  of  a  TREE  declaration,  we  will  describe  some  of  the  semantics  we  intend  for  it. 

The  trees  we  intend  to  address  are  used  to  shorten  the  longest  path  lengths  within  the  collection 
of  processors,  and  to  balance  the  workload  of  a  computation.  There  are  problems  amenable  to 
a  tree  solution,  portions  of  which  are  in  some  sense  more  important  than  others  (for  example 
Optimal  Binary  Search  Trees),  but  in  these  problems  there  must  be  a  specification  of  relative 
importance  that  has  a  size  comparable  to  the  size  of  a  good  specification  of  the  solution. 
We  will  therefore  model  solutions  to  problems  of  this  sort  by  building  separate  trees  and 
AGGREGATEing  them.  Each  tree  described  in  a  tingle  locution  will  be  balanced. 

Several  principles  govern  the  design  of  the  tree  system  of  TraNSCokS. 

V  All  trees  are  as  balanced  as  possible.  (We  use  binary  trees;  extensions  to  trees  of  higher 
ar.ty  introduce  no  new  principles.)  Ne  flexibility  in  terme  of  ihape  is  attumed,  nor  it  any  way 
provided  for  expretiing  thapet. 

*  A  tree  specification  must  include  a  size,  which  can  be  any  integer  greater  than  one. 

>  The  shapes  of  two  trees  of  the  same  size  are  identical.  That  is,  there  is  an  isomorphism 
c=  between  two  trees  of  the  same  size  that  maps  parents,  left  children  and  right  children 
respectively  into  parents,  left  children  and  right  cUldren.  There  are  *compile-time*  constructs 
in  the  TransConS  language  that  allow  for  the  specification  of  connections  to  the  node  that 
is  to  a  given  node,  or  AGGREGATION  between  corresponding  nodes  of  different  trees. 
One  way  to  achieve  this  identity  of  shape  is  to  have  a  left-biased  tree  that  is  as  balanced  as 
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0.  Trbii  or  Procbssors 


possible.  In  other  words,  pnth  lengths  from  root  to  lenres  differ  bjr  at  most  one  and  if  one 
such  path  is  longer  than  a  second  the  first  path  must  be  to  the  left  of  the  second. 

>  The  nodes  of  a  tree  are  dirided  into  three  groups.  They  arc  the  root,  the  internal  nodes,  and 
the  leares.  The  leares  are  further  distinguished  by  indices.  References  to  any  of  these  classes 
of  tree  nodes,  either  to  attach  procedure,  to  specify  communication  such  at  HEARS,  or  to 
AGGREGATE  can  be  made.  Tags  are  prorided  for  a  node  to  refer  to  a  node  of  another  tree 
that  is  m  to  it  if  the  two  trees  are  the  tame  tise.  Tlut  allows  nodes  in  =c-equiTalence  classes 
to  be  AGGREGATED  or  to  HEAR  each  other.  For  this  to  work  values  have  to  be  declared 
properly.  Note  that  a  leaf  has  to  offer  instances  of  values  that  are  HEARd  upward,  and  the 
root  has  to  offer  values  that  are  HEARd  downward. 

To  support  these  stipulations  we  have  the  TREE  data  type.  A  tree  it  declared  and  its  components 
laid  out  using  the  type  facility  of  CHI.  At  an  example,  we  will  describe  below  a  situation  where 
there  are  two  trees,  T  and  U.  Each  is  of  site  n.  Each  internal  node  of  T  passes  a  value  to  its 
children  after  having  multiplied  it  by  a  value  from  the  corresponding  internal  node  of  U.  Each 
internal  node  of  U  adds  values  from  its  two  children.  The  procedures  at  the  leaves  of  T  and  U, 
respectively,  are  described  by  functions  H  and  G,  not  interpreted  here. 

7  Istype  TREE  (i),  i  6[1  ■  •  n— 1)  site  n 

root  HAS  V  TALKS  leftion  (SENDS  v) 

TALKS  righUon  (SENDS  u) 

HEARS  source  (USES  outside-ualue) 

HEARS  U.root  (USES  u-value) 
inter  HAS  v  TALKS  leftson  (SENDS  u) 

TALKS  rightson  (SENDS  v) 

HEARS  parent  (USES  v.parent) 

HEARS  U. inter  (USES  u-value) 
leaf  HAS  U  HEARS  parent  (USES  v.parent) 
f  Istype  TREE  (t),  s  €[1  ■  • .  n-1]  SIZE  n 

root  HAS  u  TALKS  T.root  (SENDS  u) 

HEARS  (e/t4on(USES  u.le/t) 

HEARS  ri;htson(USES  v. right) 
inter  HAS  u  TALKS  T. inter  (SENDS  u  as  u-value) 

HAS  V  TALKS  parent  (SENDS  v) 

HEARS  (e/teon(USES  v.left) 

HEARS  rt;ht«on(USES  v. right) 
leaf  HAS  v  TALKS  parent  (SENDS  v) 

HEARS  some,  (USES  A,) 

(In  T.root) 

V  *-  outside-value  X  u-value 
(in  T. inter) 

V  *-  V  X  u-value 
(in  T.leafi) 

li  ^  H{v) 

(in  U .root) 

V  «-  v.left  +  v.right 
(fn  (/.inter) 

V  *-  v.left  +  v.right 

O  4-  V 

(In  U.teafi) 
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Chapter  1 


Closure- Assisted  Divide  &  Conquer 
or,  LAMBDA:  The  Ultimate  Transceiver* 


§1.1  Motivation 

Suppose  information  mutt  flow  from  procettor  B  to  processor  A,  but  there  is  a  conceptual 
adrantage  to  riewing  the  problem  at  if  information  were  flowing  the  other  way.  We  haTe 
two  motivating  situationt  where  thit  it  the  cate.  One  it  the  handthake  problem,  where  an 
intermediate  procettor  in  a  chain  of  pipelining  procettort  mutt  be  able  to  declare  itt  readinett 
to  handle  another  datum  after  it  has  proccessed  a  first.  The  second  it  problems  requiring  tree- 
structured  collections  of  interconnect^  processors.  We  would  like  to  use  divide  &  conquer  to 
synthesise  these  trees,  but  that  technique  is  difficult  to  apply  if  data  conceptually  flow  both  up 
and  down  the  tree.  It  becomes  easier  if  the  flow  it  conceptually  one  way.  We  claim  that  divide 
i  conquer  it  a  powerful  synthesis  technique  that  can  produce  a  large  class  of  tree  structured 
a.'chitectures  if  problems  can  be  rephrased  in  terms  of  one-way  data  flow. 

We  want  to  bring  about  a  structure  in  which  information  flowing  in  one  direction  tells  the 
receiving  processor  what  to  do  with  other  information  computed  in  the  receiving  processor.  We 
want  a  new  type  of  datum,  the  'telf-addressed  stamped  envelope*.  Processor  A  sends  processor 
B  an  instance  of  this  type  of  datum,  and  B  can  later  use  it  to  cause  the  data  to  be  sent  back 
to  A  and  to  be  used  properly. 

V/e  use  eloiurti  to  do  this.  We  explore  the  weaknesses  of  divide  &  conquer  without  closures 
bslow,  and  then  we  explore  some  of  the  implications  of  closures. 


f 


§1.2  Divide  &  Conquer  Paradigm  and  Tree  Synthesis 

Divide-and-conquer  (DftC)  is  a  widely  used  technique  for  the  synthesis  of  single-processor  pro- 
g-ams,  and  one  feels  that  it  should  be  a  good  technique  for  the  synthesis  of  tree-shaped  parallel 
structures.  Trouble  often  arises,  however,  when  we  try  to  use  D&C  for  this  purpose. 

*  With  spologiti  te  Gax  St««l«  [SteelvTTJ 


1.  L>.MBDA:  Thb  Ultimati  IOT 


l.a.  DrVIDB  *  CONQUCR 


Consider  what  the  DftC  technique  actually  it.  “To  lolre  a  ‘laree’  problem  initance,  break  it  into 
pieces,  solve  the  problem  for  each  of  the  pieces,  and  combine  the  solutions* .  This  is  a  technique 
for  generating  0(n]  and  0(n  log  n)  time,  single  processor  solutions  to  a  wide  variety  of  problems. 
See,  for  example,  [Smith-83]  and  [Knuth-voll). 

Intuition  would  lead  us  to  believe  that  D&C  is  useful  for  synthesising  tree-structured  parallel 
structures,  because  the  structure  of  a  solution  closely  matches  the  structure  of  the  set  of 
processors.  Three  sorts  of  problems  arise,  however; 

>  rootloek;  When  we  try  to  combine  two  subproblems*  solutions,  the  amount  of  information 
traveling  either  from  one  subproblem  to  the  other  or  from  the  subproblems  to  the  combination 
operator,  or  the  amount  of  work  necessary  to  combine,  may  be  asymptotically  large  in  the 
problem  siie.  A  naively  synthesised  parallel  structure  would  have  to  perform  all  of  this  work 
in  one  processor,  namely  a  'root*  processor  that  has  responsibility  for  combining  two  half¬ 
solutions  into  a  solution  to  the  whole  problem. 

^  sequentiality:  In  a  variant  of  D&C,  one  solves  one  of  the  subproblems  firit,  and  uses  some 
function  of  the  solution  as  a  parameter  to  the  process  that  takes  place  on  the  second  side. 
It  is  clear  that  in  this  case  no  problem  element  can  enter  the  computation  until  all  previous 
elements  have  been  used.  There  is  no  concurrency. 

^  bidirectionality;  Information  might  have  to  flow  both  up  and  down  the  tree  to  make  a  solution. 
This  situation  can  make  formal  description  of  a  combination  operator  for  D&C  hard.  It  might 
appear  that  this  condition  is  intrinsic  to  divide  &  conquer,  but  that  is  not  the  case.  The  data 
could  already  be  distributed  among  an  array  of  processors  (or  available  to  be  to  distributed) 
and  the  division  step  can  manipulate  indices  only. 

It  it  possible  to  have  bidirectionality  without  sequentiality,  but  not  vice  vena.  Rootloek  is 
independent  of  the  other  two  situations. 

These  three  properties  of  D&C  solutions  to  specifications  are  impediments  to  easy  synthesis  of 
tree- structured  parallel  structures  for  these  specifications. 

A  specification,  three  of  whose  natural  D&C  solutions  have  one  of  these  features  each,  is  Prefix 
Summation.  In  this  specification  we  have  a  vector  A  of  dimension  n,  and  we  want  to  create  a 
vector  A'  such  that  VI  <  t  <  n[a'  =  1  ^  what  follows  1  will  use  the  words  “left* 

and  “right”  as  if  the  array  were  arrangea  in  a  row  with  Oj  leftmost  and  a„  rightmost. 

One  solution  is  “to  perform  prefix  summation  on  a  non-trivial  vector,  divide  it  into  two  halves, 
perform  prefix  summation  on  each  half,  and  add  the  rightmost  element  of  the  left  result  to  each 
element  of  the  right  result*.  This  solution  has  two-way  data  fiow. 

A  second  solution  is  to  first  define  “augmented  prefix  summation  with  augend  x*  as 
VI  <  »■  <  n[o[=sx  +  s  j  $  i  perform  augmented  prefix  summation 

with  augend  x  on  a  non-trivii  vector  0(:„  divide  it  into  two  halves  ai-^i  and  perform 

augmented  prefix  summation  with  x  on  the  left  half,  and  perform  augmented  prefix  summation 
with  X  a'„,  on  the  right  half.  This  is  intrinsically  sequential. 

A  third  solution  is  similar  to  the  first,  except  that  the  result  vector  is  carried  up  the  tree  as  the 
valu?  of  the  D&C  step  rather  than  having  as  the  goal  to  develop  the  new  values  at  the  leaves. 
This  has  rootloek,  i.e.,  it  is  intrinsically  an  0{n)  solution,  as  it  requires  funnelling  the  entire 
result  vector  through  the  root. 

Our  solution  to  this  problem  is  to  use  an  upward  (toward  the  root)  flow  of  eloiuret  to  represent 
the  downward  flow  of  data. 
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The  solution  is  based  on  the  idea  of  pasting  a  form  of  data  called  a  eloiuret  up  the  tree.  A  closure 
is  a  procedure  or  function  definition  together  with  an  environment,  i.e.,  a  set  of  name/vaiue  pairs. 
When  a  cloture  is  invoked,  the  procedure  or  function  is  invoked  in  the  included  environment  as 
augmented  by  parameter  binding.  When  processor  A  passes  processor  B  a  cloture,  A  is  said  to 
be  the  closure's  ho$t  and  B  the  reetptent. 

The  actual  closure  it  not  sent.  Instead,  a  token  is  tent  that  the  recipient  can  use  to  invoke 
the  closure's  program  by  sending  back  (to  the  host)  the  token  together  with  values  for  the 
a'guments.  By  convention  this  causes  the  host  to  invoke  the  procedure,  using  stored  bindings 
and  possibly  some  new  ones  from  the  sent  arguments  at  an  environment.  The  motivation  for 
this  is  that  uiMle  eoneeptually  data  (i.e.,  the  cloeures)  are  flowing  in  ontg  one  direction,  in  fact 
aata  are  flowing  in  the  other  direction  at  well  (in  the  form  of  argnmentt  and  invocation  reqneiti). 

In  this  manner  we  can  reformulate  the  problem  from  one  of  creating  some  new  array  that  is 
a  function  of  an  existing  array  to  that  of  creating  a  closure  that,  when  invoked,  will  perform 
a  given  action  on  the  leaves  of  a  tree.  This  action  it  the  creation  of  an  element  of  the  new 
a 'ray  in  each  leaf.  The  original  specification  it  transformed  into  a  specification  that  declares 
the  existence  of  a  closure  that,  when  invoked,  will  satisfy  the  original  specification,  followed 
by  a  specification  that  the  new  closure  be  invoked.  The  three  barriers  to  simple  tree  solutions 
described  above  do  not  arise.  We  consider  a  synthesis  of  parallel  prefix  summation  in  the  next 
Chapter. 

We  have  exchanged  the  difficulty  of  reasoning  about  two-way  data  fiow  with  the  need  to  reason 
about  closures.  We  feel  that  this  is  a  good  bargain  because  reasoning  about  closures  only  requires 
the  addition  of  new  axioms  to  a  theorem  prover's  data  base,  while  two-way  data  fiow  requires 
changes  in  the  way  we  look  at  D&C.  Below  we  show  that  this  change  of  view  costs  little  speed, 
and  in  the  next  Chapter  we  show  that  no  expressive  power  is  lost. 

We  conjecture  that  this  technique  can  bring  most  O(logn)  and  0(!og*  n)  tree  parallel  structures 
within  the  reach  of  a  D&C-based  synthesis  method.  We  support  this  conjecture  by  several 
synthesees  in  the  next  Chapter.  Since  a  tree-structured  processor  is  inexpensive  to  manufacture 
compared  to  more  highly  interconnected  machines  and  seems  to  be  reasonably  powerful,  we  feel 
that  automatic  tools  that  make  use  of  this  power  easier  would  be  an  important  contribution  to 
the  technology  of  synthesis  of  parallel  structures. 

We  first  prove  that  the  computation  of  the  closure  in  the  root  node  is  fast: 

Iheorem  1.1.  Suppote  a  problem  flit  a  divide  and  conquer  tcheme  without  lequentiality  or 
bidirectionality.  That  it,  that  the  computation  of  the  result  in  question  for  the  substring  of  the 
problem  ranging  from  I  to  u  it 

iff=uthenV5 

‘"\otherwise 

and  T{G)  (the  time  to  compute  G)  it  <  0{F{u — I  -l  1)),  wheT~  F  it  a  nondecreasing  /unction. 
Then  T{V”)=0{F{n)  log  n). 

Proof  :  Note  that  the  form  of  the  definition  of  V*  precludes  sequentiality  and  bidirectionality. 
We  are  using  value  semantics  for  the  call  to  G. 

^(^<)=7’(V'),  so  T{V')  is  bounded.  Say  T{G)  <  coF{u — t-f- 1).  We  offer  an  inductive  proof  that 
<  co^(u— /+  l)lg(u— f  +  l)-4-  TCV').  where  co  is  the  constant  of  r(V7j=0(F(n)  logn). 

The  base  case  is  immediate. 
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U  1 7^  u  then 

7(^1*)=  +  T(G)  (definition,  nonsequentiality) 

<  coF((u-I  -h  l)/2)  lg{(«-l  +  l)/2)  +  T(V*)  +  T(G)  (by  induction) 

<  coF(u-l  +  1)  lg(u-/  +  1)  +  T(V')  (monotonic  F) 

This  is  0(F(u-/+  l)log(u-/+  1)),  which  is  0(F(n)  log n)  at  T(Vf).  | 

This  theorem  only  holds  if  sequentiality  and  bidirectionality  are  not  present.  Sequentiality  can 
not  be  present  because  T(V|‘)=max(T(v|('+*>^*J),r(Vf(,^,.^i)/j,))  +  T(G)  only  holds  if  the 
computation  of  the  V’t  can  proceed  in  parallel,  and  bidirectionality  must  not  be  present  as  there 
is  nothing  in  the  statement  of  the  theorem  to  allow  for  this.  It  holds  eren  if  rootlock  is  present, 
but  in  such  a  case  the  theorem  produces  a  weak  result,  since  F(n}  would  be  large. 

We  then  prore  that  the  application  of  the  closure  that  is  computed  in  the  root  is  also  fut: 

Theorem  1.2.  Suppoit  a  elo$ure  u  computed  in  the  root  of  a  balanced  binarp  tree.  That 
cloture  can  contain  clotures  whose  hosts  are  its  children.  Those  clotures,  in  turn,  can  con¬ 
tain  clotures  whose  hosts  are  their  children,  etc.  Suppose  all  clotures  computed  within  the 

tree  are  of  the  form  G“=X^‘  [G(^“',T'jJ,^i,  W,*)]  where  Vf  includes  C“  and  values 

that  are  available  in  constant  time,  WJ*  includes  locally  available  values,  and  G  it  of  the 
form  G{7,,V,)={C,(Gi{V,,Z))  ||  Cr(G,{V,.Z  II  Go{7,,Vr,Wf))))  (here  C,  it  the  cloture  con¬ 
tained  in  Vi  and  Go  can  affect  Wf.)  If  max(T(Gt),T[Gi),T[G,))—0{F[l—u  +  1)),  then 
r(C:)=0(F(n)  log  n) 

Frocf:  Tirtually  identical  to  previous  proof.  | 

In  summary,  the  technique  cf  computing  closures  from  component  closures  is  a  technique  which, 
together  with  divide-and-conquer,  provides  the  ability  to  synthesise  a  wide  variety  of  tree 
structures  with  few  of  the  technical  problems  that  other  synthesis  methods  might  encounter 
concerning  reasoning  about  path  lengths  or  the  cardinality  of  sets  of  nodes.  It  allows  us  to  do 
this  and  to  still  produce  the  0(log  n)  (or  0(log'  n)  for  small  i)  parallel  structures  we  expect  from 
trees. 


$1.3  Description  of  Closures 

A  closure  consists  of  a  procedure,  and  bindings  for  some  of  the  procedure’s  free  variables.  The 
procedure,  in  turn,  consists  of  a  piece  of  program  and  a  binding  list.  The  concept  was  first 
described  in  Church’s  X-calculus  [Chureh*31].  Closures  are  valued  for  their  expressive  power 
even  on  single-processor  algorithms.  They  are  elements  primarily  of  dialects  of  LISP.  See,  for 
exanple,  [Steele-77],  [Mooo>82j,  [Interllsp-SSj.  A  similar  concept,  acton,  is  also  found  in  other 
languages  (See,  for  example,  PLASMA  in  [SmlHew-75j.)  Actors  are  also  described,  as  here,  as 
a  method  of  expressing  interprocessor  communications  concepts.  I  here  explore  a  case  in  which 
it  m  kkes  the  task  of  writing  programs  an  easier  one  for  computers. 

It  is  common  to  use  the  notation  Xxi,Z3,...,Zn[F(zi,Za,...,Zn,Pi,V3,...,ym)]  to  denote 
abstraction  of  a  function  of  n  parameters  from  a  function  of  n  +  m  parameters.  The  p,’s  are 
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fiee  Tariables,  meaning  that  their  valaes  are  determined  hj  some  of  the  context  in  which  the 
function  is  evaluated. 

V/e  will  use  *X*‘- ;;  [F(*i ....,  vi ,...,  21  ,...))•  to  denote  a  piece  of  program  text  that  makes  a 
closure  that  can  be  applied  to  as  many  parameters  at  there  are  x’t.  In  other  contexts  we  will 
use  that  to  name  the  closure  itself.  When  it  is  applied  the  z-valuet  from  the  application,  the 
V  values  available  at  cloture  creation  time  and  the  2*Taluet  at  application  time  will  be  used.  The 
y'i  are  called  the  closed  varta61es.  We  will  use  •[F(zi,  . .  • ,  Vi,  • .  • , Xi,  ■ . . )|  created  by 

the  above  fragment  to  denote  the  cloture  in  which  yi=svi, _ 


§\.4  Transmitting  a  Closure 

1 0  transmit  a  closure  from  one  processor  to  another,  it  is  not  necessary  to  transmit  the  entire 
program  and  all  of  the  environment  values,  provided  that  the  processor  tending  the  cloture 
stands  willing  and  able  to  perform  the  work,  and  that  the  side  effects  are  within  reach  of  the 
sending  processor.  In  the  cases  we  explore  there  are  side  effects  that  are  only  within  reach  of  the 
sending  processor.  Since  the  motivation  for  clotures  in  the  first  place  was  the  desire  for  a  datum 
t’lat,  when  exercised,  would  cause  certain  desirable  behavior  in  the  host,  tbit  vrill  normally  be 
t'le  case. 

All  that  is  necessary  is  that  the  transmitting  processor  send  a  token  of  tome  sort.  The  receiving 
processor  can  save  the  token  and  later  use  the  cloture  by  sending  back  the  arguments,  the  token, 
aid  control  information. 

V/hen  this  it  done,  the  processor  tending  the  closure  (and  willing  to  do  the  work)  it  called  the 
c  osure’s  hott,  and  the  receiving  processor  (which  has  a  license  to  use  the  cloture)  is  called  the 
rfcipient. 

We  say  that  a  cloture  it  live  if  there  is  a  possibility  that  it  will  be  invoked  at  a  given  time. 
A  closure  becomes  live  when  it  it  sent  and  remains  live  until  the  recipi^t  reaches  a  point  in 
it  s  procedure  past  which  it  can  not  invoke  the  cloture.  We  will  have  more  to  say  about  issues 
concerning  the  liveness  cf  closures  during  the  remainder  of  this  Section. 

Closures  can  be  efficiently  implemented  in  a  reasonable  machine  model.  Internally,  a  closure  can 
be  implemented  at  a  block  of  memory  locations  containing  a  'pointer*  to  the  program  fragment 
and  a  list  of  all  closed  variables  and  the  corresponding  values.  A  pointer  to  the  block  could  be 
used  as  the  token.  When  a  closure  is  applied  the  recipient  can  send  the  host  a  copy  of  the  token, 
together  with  whatever  other  information  is  needed  (primarily  the  argument(s)).  The  host  can 
use  the  received  token  and  can  invoke  the  proper  code  with  the  proper  environment  and  with 
the  arguments  bound  to  the  parameters  by  using  the  information  contained  in  the  closure  and 
message.  A  piece  of  program  text  (in  the  hott  processor)  that  creates  a  cloture  will  be  called 
a  cloture  generating  form  or  CGF,  and  a  piece  of  text  (in  the  receiving  processor)  that  invokes 
one  will  be  called  a  cloture  invoking  form  or  GIF.  The  clast  of  clotures  generated  by  one  CGF 
is  a  family.  An  instance  of  the  family  of  closures  generated  by  a  specific  CGF  named  C  will  be 
called  a  C  imtance  or  an  irutance  from  C.  Members  of  a  family  differ  only  in  the  environments, 
since  the  code  will  be  the  same. 

Ihe  required  data  transmission  can  be  reduced  in  cases  where  it  is  possible  to  infer  various 
things  about  the  use  of  a  closure.  For  example,  if  it  is  known  that  only  one  instance  from  a 
g  ven  CGF  is  live  at  a  time,  the  host  needs  not  send  the  token,  but  only  the  name  of  the  CGF. 
1  hat  name  would  not  vary  and  can  be  'assembled  into*  the  CIF.  This  can  be  true  even  if  there 
can  be  several  CGF  instances  for  a  given  CGF,  provided  that  the  host  knows  what  order  the 
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recipient  will  use  the  closures  it  receives.  If  there  it  onlj  one  CGF  in  a  processor,  and  only  one 
inst?nce  of  the  closures  that  it  generates  can  be  live  at  one  time,  the  token  can  vanish;  the  fact 
that  the  ‘‘receiving*  processor  wants  to  apply  any  closure  it  information  enough!  The  closure 
has  '.leen  completely  swallowed  up;  information  only  travels  from  the  recipient  to  the  host,  even 
though  the  synthesis  was  performed  as  if  data  flowed  only  in  the  other  direction. 


A  further  simplification,  of  interest  for  the  problem  of  synthesising  parallel  structures  that  will 
latci  be  reduced  to  VLSI,  is  available.  Suppose  the  following  conditions  are  met;  Applying  a 
closure  does  not  include  changing  state  in  the  host  processor.  (In  this  case,  for  the  application 
to  bo  useful  it  must  cause  other  applications  in  the  host.)  Assume  also  that  there  is  only  one 
live  '.losure  in  a  given  family  at  any  time.  Assume  further  that  the  values  used  in  that  closure 
to  cxll  other  closures  hosted  elsewhere  can  be  computed,  using  only  values  available  to  the  host, 
by  means  of  combinatorial  logic  (the  code  fragment  is  loop*free  and  consists  only  of  operators 
chosen  from  a  library  of  integrateablc  operators). 


In  this  case  it  is  possible  to  perform  the  closure  using  only  ‘combinatorial  logic*  in  the  host 
processor.  Specifically,  no  register  need  be  provided  to  hold  the  closure’s  parameter  in  the  host 
processor.  Instead,  logic  must  be  provided  to  map  a  signal  representing  an  application  of  the 
closv.re  to  signal(s)  reprewnting  application(s)  of  the  subsequently  called  closure(s).  Registers 
are  provided  to  hold  all  of  the  values  of  the  closure.  An  example  takes  from  the  Parallel  Prefix 
structure  (whose  derivation  sketch  is  in  the  next  Chapter)  will  make  this  clear. 


We  have  the  code  fragment  to  synthesize  a  closure,  namely  Xf‘’®'***[Cj(jr) )]  Cr(vi  +  z)]^  Here 
it  esn  be  established  that  there  is  only  one  outstan<Hng  instance  of  the  closure  at  any  time, 
that  the  closure  does  nothing  more  than  apply  other  closures  to  a  function  of  its  argument,  and 
that  the  computation  performed  on  the  argument  is  "easy*.  We  can  therefore  use  the  circuit  of 
Figure  1; 

^For  clarity,  the  exposition  assumei  that  the  prefix  operation  is  addition,  and  that  we  consider  addition 
to  be  integrateable. 
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Fxqwt  1.  Simplified  PutlUl  Prefix  Internal  Node 


S'..5  Formal  Arguments  for  the  Admissibility  of  Closures 

In  this  Section  ire  formaiiie  the  notions  ire  use  to  argue  that  restricting  communication  to  the 
upirard  direction  in  trees  is  a  harmless  restriction,  not  prerenting  the  synthesis  of  tree  parallel 
structures  to  meet  any  specification  that  could  haiu  been  met  absent  tUs  restriction,  provided 
only  that  we  also  allow  upward  communication  closures  and  that  we  not  consider  the  application 
of  a  closure  that  was  communicated  upward  to  be  a  downward  communication. 

First  we  need  a  formal  definition  of  a  tree  parallel  structure: 

Definition  1.3.  A  tree  parallel  structure  ftrre  ttructure)  it  a  eelleetion  of  proecsieri  together 
xfith  programt  that  meet  all  of  the  following  eoniitioru: 

>  There  are  three  tgpei  of  node:  leaver,  interier  nodei  (which  need  not  be  pretent),  and  the  root. 

>  (tree)  There  are  variant  two-way  connections  (“miret'’)  between  nodet  at  followt:  roots  have  a 
left  and  right  wire;  interior  nodet  have  a  left,  right,  and  parent  uiire,  and  leaf  nodes  have  a 
parent  wire.  A  parent  wire  mutt  be  connected  to  either  a  left  or  a  right  wire,  and  vice  versa. 
Mode  A  (retp.  B)  it  an  ancestor  (retp.  descendantj  of  the  other  if  there  it  a  path  from  it  to 
the  other  uiing  only  left  or  right  -*  parent  (retp.  parent  -*  left  or  right)  wires.  If  the  firtt 
wire  on  the  path  to  a  descendant  it  a  left  (retp.  right)  wire  the  descendant  i«  a  left  (retp.  right^ 
deicendant. 
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a  ^Rum6erei  ltaut$}  The  /eavei  art  indtxed  bp  a  totailp  oriered  index  set  ('numbered’’)  to  that 
the  index  of  one  leaf  mutt  be  lett  than  the  index  of  a  tteond  lex/ if  there  it  a  common  ancettor 
for  which  the  firtt  leaf  it  a  left  detcendant  and  the  tecond  a  right  detcendant. 

a  (homogeneout)  All  nodet  of  one  type  run  the  tame  program.  Programt  are  allowed  to  do 
reatonable  formt  of  computation  and  to  try  to  tend  and  receive  information  on  the  wiret. 

a  (ttngly  buffered)  If  a  program  triet  to  receive  information  ever  a  given  wire  it  will  do  nothing 
elte  until  the  program  of  the  node  at  the  other  end  of  the  wire  triet  to  tend.  If  a  program  triet 
to  tend  on  a  wire  twice  without  the  other  program  having  tried  to  receive,  the  tending  program 
will  do  nothing  elte  until  the  other  program  triet  to  receive.  Programt  may  perform  cloture 
application  with  no  regard  to  thete  rettrictiont,  but  the  trarumittion  of  the  cloturet  mutt  haoe 
obeyed  thete  conditiont.  Programt  may  tett  whether  a  line  hat  or  can  accept  data  and  therefore 
avoid  waiting  if  it  can  t.  The  tituation  where  neither  program  at  either  end  of  the  wire  can  tend 
or  receive  it  pottible,  6ut  only  for  a  bounded  amoutU  of  time. 

We  need  a  definition  of  a  tree  parallel  structure  with  upward  communication  onlj: 

Definition  1.4.  An  upward  tree  parallel  structure  it  a  tree  parallel  structure  in  which  no  com¬ 
munication  it  tpeeified  from  any  left  or  right  end  of  a  wire  to  the  corretponding  parent  end. 
Cloture  application  doet  not  count  at  a  communication. 

This  is  a  formal  definition  of  the  objects  described  by  TREES  statements,  and  in  the  rest  of 
this  Subsection  we  will  explore  some  of  the  implications  of  this  definition.  In  particular  we  are 
interested  in  an  assertion  that  limiting  communication  to  an  upwards  direction  but  allowing 
closures  gi^es  the  same  expressiTe  power  as  allowing  communication  in  both  directions  but  not 
usin;;  closures. 

First  we  need  a  lemma. 

Lemma  l.$.  Suppote  we  have  two  proceteort  A  and  B  with  two  wiret  ail  and  ab2  from  A  to  B. 
The.>e  wiret  obey  the  "tingly  buffered’  condition  above.  It  it  pottible  to  timulate  thote  two  wiret 
with  a  tingle  wire  with  no  more  than  a  conttarU  factor  tpeed  lott. 

Prod :  Replace  the  wire.  Replace  occurrences  of  read{ail,  x)  (reip.  ai2)  in  B  with  the  fragment 
while  andeflaed(vail)  do  ehech();od;x  «-  vail;  vail  *-  undefined.  Replace  rcadable(ail)  with 
deflned(vail).  In  A,  replace  sead(ail,x)  with  while  defincd(vail)  do  checi();od;  vail «-  z;  and 
sendable(vail)  with  undcfined(vail). 

Insert  ‘cheefcO*  sufficiently  often  to  guarantee  execution  periodically,  with  a  period  short  com¬ 
pared  to  the  time  it  takes  to  communicate  between  processors.  The  check{)  call  in  A  checks 
whether  vail  and  vai2  are  defined.  If  either  is  defined,  say  vail,  checkQ  sends  the  pair  ((1,  vail)) 
over  the  wire  and  does  vail  *-  undefined.  The  check  call  in  0  is  a  finite  state  machine.  In  its 
initid  state  it  checks  whether  there  is  anything  to  read  on  the  wire;  if  there  is,  it  reads  it.  This 
should  be  a  number  i;  the  FSM  enters  a  state  5,-.  If  checkQ  is  in  5(,  then  it  will  cheek  whether 
vail  is  empty  and  only  if  so  it  will  read  the  next  object  from  the  wire  and  enter  the  initial  state. 

Enu  aeration  of  ^he  sequences  of  actions  on  the  two  wires,  actual  and  simulated,  serre  to  establish 
corr-ctness.  That  there  is  only  a  constant-factor  slowdown  can  be  derived  from  the  fact  that 
chec  to  does  a  constant  amount  of  work  unless  it  waits,  that  it  only  waits  if  (and  as  long  as)  the 
simulated  machine  would  have  waited,  and  that  it  replaces  each  communication  with  a  constant 
numoer  (two)  of  communications.  | 

.Now  we  can  prove  a  fundamental  theorem  about  umdirectional  communication  in  a  tree. 


1  LAMBDA:  THt  Ultsmati  IOT 
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IheoTcm  1.6.  Suppote  we  have  a  tree  parallel  itrueture  T  without  trarumueion  of  eloturet. 
7'hen  it  it  poitible  to  perform  the  tame  eomputatiori  that  T  perform!  on  an  upward  tree  parallel 
Itrueture. 

Froof  :  Simulate  a  secoud  wire  from  each  child  to  its  parent  per  the  previous  theorem.  Call 
tuat  wire  Ci  (Cr)  where  it  impinges  on  the  parent  and  Cp  where  it  impinges  on  the  child. 

I  he  nodes’  programs  must  be  modified  as  follows:  All  parti  of  the  program  must  remain 
unchanged  except  for  downward  communications,  which  consist  of  sending  statements  of  the 
f  irm  (1)  write(left,  z)  and  (2)  sendable(left)  (or  right,  of  course),  and  receiving  statements  of 
the  form  (3)  read(parent,  z)  and  (4)  readable(pareiit).  These  four  forms  are  directly  translated 
as  follows;  (l]=read(C|,  C);C(z),  (21=readable(C|),  (3)=wbUe  ttBdefincd(v)do  checfc();od.  z  *- 
V  V  *-  undefined;  send(Cp,  X*[v  z]}  and  deflned(v).  eheek()  is  from  the  previous  theorem. 

Additionally  prepend  '‘send(Cp,  X*iv  z])*  to  former  recipients'  programs  and  append 
‘.^ad(C|,C);C(z)  to  former  senders’. 

That  this  causes  correct  information  to  be  seen  in  the  recipient  is  evident  from  the  observation 
that  each  closure  is  used  to  send  exactly  one  value  to  the  recipient,  exactly  that  value  is  used 
as  an  argument  to  the  closure  as  was  previously  being  sent,  and  it  is  only  used  once  (and 
immediately  rendered  undefined).  That  this  causes  the  programs  to  ‘hang*  at  exactly  the  right 
t  mes  can  be  easily  seen  from  the  fact  that  there  is  a  closure  in  (say)  Ct  exactly  when  the  recipient 
would  have  been  receptive,  and  there  is  a  value  in  n  exactly  when  there  would  have  been  a  value 
arailable.  | 

1  he  key  point  to  note  is  that  all  downward  communication  is  expressed  as  closure  application. 
This  sugf^ests  that  it  will  be  possible  to  express  a  problem  that  apparently  can  not  be  solved 
by  divide  &  conquer  as  the  corresponding  problem  of  creating,  in  the  root,  a  closure  that  has  a 
desired  result  when  appl  ed. 

V/e  have  therefore  shown  that  we  do  not  surrender  any  expressive  power  when  we  limit  tree 
declarations  to  upward  communication. 


ChaT'ttT  2 


Examples  of  the  Use  of  Closures 


§2.1  The  Handshake  Problem 

Suppose  we  have  a  pipeline  of  information.  Data  are  supplied  at  one  end  of  a  chain  of  proces* 
sors,  processed  by  every  intermediate  processor  (perhaps  in  combination  with  data  flowing  the 
othe'  way),  and  the  results  are  either  extracted  at  the  other  end  or  developed  in  some  of  the 
intermediate  processors.  An  example  of  a  problem  that  can  be  easily  solved  with  a  parallel 
structure  of  this  sort  is  convolution,  where  the  speciflcation  VA,53A'[Vt[ 
must  be  met.  This  can  be  accomplished  by  a  row  of  processors,  each  responsible  for  computing 
one  -slement  of  a!,  and  a  regimen  in  which  the  A- values  flow  one  way,  oi  first,  and  the  B- values 
flow  the  other  way,  in  first. 

To  perform  the  synchronisation  using  closures,  we  would  have  to  state  that  the  I/O  processors 
at  er^ch  end  provide  a  closure  that  can  be  used  to  obtain  the  next  datum. 

There  are  two  possible  ways  that  use  of  a  closure  can  result  in  data  coming  to  be  available  to  the 
poinc  of  use.  Either  the  value  can  be  returned  as  the  result  of  the  application,  or  the  application 
can  ;ause  the  datum  to  be  sent  separately  by  the  closure’s  host. 

We  r>refer  the  latter.  We  like  closures  not  to  return  values,  as  we  would  have  to  invent  a  syntax 
to  allow  other  computations  to  proceed  while  awaiting  an  answer.  Expressive  power  is  not  lost 
in  forbidding  closures  to  return  a  value,  because  one  can  instead  have  the  value  returned  as  a 
sepa'ate  communication.  Our  prime  purpose  in  setting  up  parallel  structures  is  to  allow  different 
procissors  to  do  different  but  related  work  simultaneously,  and  this  would  be  compromised  by 
this  restriction.  1  will  assume  this  convention  in  what  follows.  It  is  clear  that  the  difference  is 
one  of  convenience  and  not  a  fundamental  one.  It  does,  however,  allow  for  such  features  as  a 
natural  method  for  having  a  value  be  the  result  of  the  application  of  more  than  one  closure. 

Frequently  a  link  will  be  used  for  more  than  one  value.  If  it  is  convenient  to  use  a  closure  to  get 
succ  ieding  values,  there  will  be  many  applications  of  closures.  There  are  two  ways  to  manage 
this:  either  a  single  closure  can  be  invoked  several  times,  or  use  of  a  closure  can  cause  a  new 
dost  re  as  well  as  the  next  value  to  be  sent. 

The  obvious  apparent  disadvantage  of  the  latter,  that  it  would  seem  to  require  the  transmission 
of  extra,  useless  data,  is  not  real.  When  the  use  of  a  closure  causes  it  to  become  dead  but  causes 


3  Exampli*  op  thi  Usi  op  Closuxis 


3.a.  OlVIDX-AND-CONQCIK  WITH  CL0$URIS 


another  one  to  be  sent,  we  hare  the  situation  where  only  one  instance  of  a  (iren  clast  of  cloture 
can  be  live  at  one  time.  In  this  situation  the  cloture  need  not  be  tent 

If  a  closure  is  invoked  repeatedly,  this  causes  a  problem  in  determining  when  a  closure  it  dead. 
1  his  problem  it  not  unique  to  this  circumstance,  however;  there  can  arise  a  cate  in  which  it  it 
not  known  whether  a  closure  will  be  used  even  once. 


§2.2  Divide*aDd-CoQquer  with  Closures 

In  this  Section  we  will  consider  the  broadcast  problem,  the  prefix  summation  problem,  and  a 
part  of  one  lolution  to  the  connected  components  problem  that  is  amenable  to  tree  solution. 


§!i2.2.1  Broadcast 

In  the  broadcast  problem,  a  value  or  values  known  in  a  central  location  are  distributed  to  many 
locations.  The  broadcast  problem  can  be  described  formally  as  «-  F(a<.  z)]  or  perhaps 
V>[Vt[a'^  *-  F(aiy,zy)]].  One  method  of  synthesising  solutions  to  this  problem  might  be  to 
recognise  it  as  a  distinct  pattern  and  carry  a  synthesis  rule  that  produces  a  broadcast  tree  when 
s\ipplied  an  instance  of  a  broadcast  problem.  Another  solution  it  to  produce  a  chain  of  processors 
as  a  bucket  brigade  to  distribute  the  information,  and  then  to  tuecettively  split  the  chain  in 
half,  but  this  hat  the  problem  that  the  synthesis  process  it  iterated  a  variable  number  of  times. 
With  the  new  mechanism  of  cloture  patting,  it  it  possible  to  provide  more  general  rules  that 
handle  broadcast  problems  at  a  special  cate  without  multiple  reformulations. 

Consider  the  application  of  divide  k  conquer.  We  want  to  produce  a  cloture  that,  when  applied 
to  Zj,  performs  Vt[a'^  <-  F(aiy,Zj')I.  We  hypothesise  that  to  solve  the  problem  we  for  a  whole 
tubarray  we  can  solve  the  problem  for  each  of  two  pieces  of  the  tubarray  and  combine  the  two 
solutions  in  some  manner.  Giving  the  names  //  and  /r  to  the  clotures  for  the  left  and  right 
halves  of  the  problem  and  /w  to  that  for  solving  the  whole  problem,  we  then  show  that  to 
combine  closures  /l=X,lV;er,[o'  «-F(a^,x)ll  Tala',  «-F(o^,x)ll  ^ 

to  create  /w— X^'-^'[/l(y)  ||  Jr(y)].  We  go  through  the  following  sequence: 

VA,x3A'[V,-6[l...nl«=F(o<.»)]) 

3  C(x)[Mtlon(C(x))=s  'iX,  *(Vs  6ll. . .  n][o|~r(a„  x)]J)  (abstraction) 
hypothesis;  s  3Cj‘laetlen(Cj‘{»))=VA,xlVs€ll...u)lo’=F{a,-,x)]  (division) 

A  cr(*)  s  c(cr'(c,(*)).c:,+,(G,(i)))i) 


The  abstraction  step  it  the  step  of  asserting  that  there  it  a  function  whose  application  brings 
aoout  the  FOL  expression  that  is  being  abstracted.  The  division  step  it  the  step  of  asserting  that 
it  is  possible  to  build  a  cloture  that  solves  a  large  problem,  given  closures  that  solve  subproblemt 
(■'md  possibly  other  data). 

This  can  be  satisfied  by  setting  G(C|(z),C3(y))  =  G](z}  ||  C^^y)  (concurrent  composition)  and 
G,(z)=C,(x)=x. 

It  only  remains  to  describe  the  procedure  for  handling  a  singleton  array.  This  it  the  cloture 

x;[o:.-F(a„z)]. 


3.  ESAMrL*9  or  THE  U»  or  CLOSURIt 


3.3.  Divisb-and-Conqubr  with  Closures 


The  computation  of  the  top  level  closure  is  O(logn)  where  n  is  the  sise  of  the  problem.  This 
is  char  from  the  reasoning  of  Section  1.2and  from  the  observation  that  time(G)=0(l].  (G 
is  creation  of  a  closure  enclosing  two  given  closures.)  Similarly,  the  time  consumed  by  an 
appl. cation  of  the  top  closure  will  be  O(logn)  from  the  fact  that  max(time(Gi),time(G,))=G(l). 
(G)  and  G,  are  identity  operations.) 


$$2.2.2  Parallel  Prefix 


2.2.2. 1  Overview 


To  use  the  closure  technique  on  a  given  specification,  reformulate  the  problem  from  something 
like  VX,...  sriPCX.y)]  to  3C[VX,...aetloBG()=P(X,7’)I.  Heuristically,  the  problem  is 
reformulated  from  that  of  satisfying  a  specific  input/output  specification  to  that  of  producing  a 
closure  that,  when  applied,  will  cause  the  I/O  specification^  to  be  satisfied. 

We  will  need  to  define  “augmented  prefix  summation  with  augend  >*  at  VI  <  i  <  n\c^^sss  + 
“7  tuk  it  to  deliver  to  the  root  of  the  tree  a  cloture  that  will 

perform  augmented  prefix  summation.  To  create  a  cloture  that  will  perform  augmented  prefix 
summation  with  augend  a  on  a  non-trivial  vector,  divide  it  into  two  halves,  get  such  a  cloture 
from  each  half  together  with  the  grand  total  of  the  input  values  for  that  half,  invoke  the  left 
half’s  closure  with  a  as  an  augend  and  the  right  half’s  with  a  +  the  left  half’s  sum.  We  deliver 
to  eitch  node  of  the  tree  clotures  that  will  perform  augmented  prefix  summation  on  the  vector 
comprising  its  leaves,  together  with  the  leaves’  sum.  Note  that  the  cloture  delivered  to  each 
node’s  parent  has  to  include  the  left  subtree’s  sum,  which  it  available  now  but  won’t  be  later. 
A  more  formal  description  follows. 

Assume  that  a  vector  it  divided  into  A(i...s<|  and  A[«>4.t...«|.  Further  assume  that  we  are 
trying  to  compute  F(A[i....,)  which  we  will  denote  F*.  Further  assume  that  we  want  to  have 
som<  effects,  local  to  the  array  elements.  We  would  therefore  want  to  compute  a  cloture,  G*, 
that  would  have  the  desired  effect. 


The  generic  combination  operator  for  the  values  is  F’I‘=G(F'|‘',F;.^,,f,u,u')  and  it  it 
a  synthesis  task  to  derive  the  properties  of  G.  Similarly,  G}‘s=G(G,’‘',G|J,^j,f, u.u*). 
If  the  closure  has  an  argument  the  situation  it  slightly  more  complex;  we  have 
G|‘(?)=G(C|‘'(G((r,FJ‘',F;,^,,f,o,u'),C;,^i(x,F|‘',F;,^i,f,u,H'),/,u,u')  where  the  F  vec¬ 
tors  are  the  values  available  to  (and  incorporated  in)  Cf.  This  general  schema  need  only  be 
used  with  specific  combiners  (i.e.,  G,  G|,  etc.).  At  a  simple  example,  prefix  summation  can  be 
performed  by  this  schema  if  G  s  (G|,/j  ||  C,,,ai)  (where  ||  it  concurrent  application),  Gt(g)=t, 
and  Gr(2)=z  +  V|.  v,  in  turn, it  computed  as  vi  v,.  Singleton  v-  and  C-expretsions  are 
C,  =':  X2[o'.  ♦-  a,'  +  2]  and  t»,=o,-. 


2.2.7.2  Derivation 

In  this  problem,  the  specification  to  meet  is  Vf€(l.  .  nHo'^  *-  .  I  will  introduce 

the  r.bbreviation  a,.  This  then  becomes  Vi6[l. . .  nj[a'.  •-  2|].  We  change  the 

^Moic  precisely,  the  problem  of  satisfying  the  I/O  specification  that  requires  no  input  and  produces 
the.  closure 
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specification  to  one  requiring  the  computation  of  a  closure  which,  when  applied  to  no  arguments, 
performs  this  action;  together  with  the  application  of  that  closure. 


VAaA'(Vt6[l...n![a',=  “iH 

=*3<7VA(aetion(C0)=[Vt€(l-»»IK=  ^  ajll  (abstraction) 

hjpothesis;  =  3C“VA[aetion(Cj‘())=Vi€(l.  •  .u][o',=  Oj]  (diTision) 

A  Cr  =  G(C?'(),Ct,+,())i 


But  aetion(C“,^i())=  Vi  €[u' +  1. . .  u](a(=  so  this  is  impossible.  must  be 

p.'OTided  with  a  parameter  to  be  able  to  do  this. 

\/e  modify  the  closures  so  instead  of  aetlon(Cj‘()]  s  ...  we  hare 
action(CJ‘(2))=  Vt  6[f. . .  We  do  not  yet  know  the  properties  of  H. 

We  now  have; 


So  we  obserre 


action(C7r(2))=  Vi  6[1. . .  01 

/ 

i 

aetioMCr'  (2))= V  i  ell. . .  u‘1K=H(;^,  »,  i)] 

t 

a«tlon(<?;,4.i(z))=Vfe(u'  +  l--  «)lo<=/f(  ^  ,E,i)l 

*'+i 


aetion(C|‘(2)) 

I 

=  (Vi  e[l. . .  ul[o'j=H(  J]),  z,  01)  (aboTe) 

I 

=  actlon((?(C|‘'(G,(*)),  C;,+,(G,  (*))))  (D&C) 

=G((Vf  6[1. .  .o'][a(-//(^i,G|(2),i)]),(Vi6[«'  4*  1. . .  u|[al.=H(  ^  ,G,(*),i)!)){*) (expansion) 


•'+1 


=  vie[i...ii'i[o:.=/f(j^,E,i)l  A  Vfe[u'  +  i...tiiK=H(^,*.i)l 


(V  identity) 


A  ssuming  G  merely  generates  a  closure  to  produce  application  of  both  of  its  parameters,  then 
A’(I:;,*,0=//(I:|.G,(e),0  and  /f(E;.e.0=H(j:i.+i-^'(')-0-  The  arst  unifies  to  e=G,(E). 


3.  EXA.MPLI1  or  TKB  UsB  or  Closvhh 
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The  second  needs  a  bit  more  attention.  If  we  represent  “  Er  "mi'+ii  ^ 
ff(rJ.f.>)=mE!‘'  +  so  i/((7+r,2,0=H(<7,G,(z),0  where 


Lett:ng  //=Xx,y[z  +  p]  we  get  Gr=Xx[z  +  r].  This  leads  to  another  problem,  that  there  isn’t 
enovgh  information  around  to  compute  G,.  We  hare  to  expand  the  problem  again  to  bring 
about  the  aTailability  of  intermediate  values  for  the  intermediate  closures.  In  this  case  we  need 
Instead  of 


aetlonfCj^z))  =  aetlon(G(Gr'(G,(z)).G:,+i(G,(z)))) 


we  want 


t;r=/f(or'.c+i) 


and 


aetion(Gr(z))  =  aetion(G(Gr  (G,(«r'.c+t.*)).<?:'+i(Gr(«r , m 


Taking  a  more  intuitive  view  for  the  moment,  we  observe  that  we  want  to  compute  a  two-tuple 
U«?,G“))  in  which  and  in  which  action(Cf(z))  is  the  computation  of  an  augmented 

prefix  summation,  where  a'- «-  z  -f  instead  of  o'-  ♦- 

We  want  G,(v,“',  z)=z  +  Yia  >  >o  't*®  vf—  2*’  or  vf= 

We  lack  only  one  step  to  a  complete  solution.  Initially  we  wanted  to  compute  a  closure  which, 
when  computed  for  that  "sub-array''  which  is  the  whole  array  and  applied  to  no  argument, 
computes  the  prefix  sum.  We  will  get,  instead,  a  pair  of  results.  One  of  the  results  is  a  value, 
and  the  other  is  a  closure  which,  when  applied  to  one  value,  computes  a  generalization  of  the 
prefix  sum.  It  remains  to  convert  this  back  into  a  closure  that  can  be  applied  to  no  arguments. 

We  liave 


aetion(C,“(z))=  Vs  €[f. .  •  «](«'  =  ^  +zl 

3 


and  we  want 


n 

aetlon(^’'())Vi6[l  •  ■  »»I[o'i=  =  action(G7(z))=Vie[l.. 

I 


■rz\ 


for  some  z.  Clearly  x=0  works. 
Summarizing,  we  have  all  of  the  following: 


3  Examflis  or  thb  Usb  or  CiosuRts 
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aetloB(C(;)= 

aetion(C“(;)= 


V 1  <  t  <  n[a'=  ^1 
1 

v/<t<u[o'i=52i 

1 

i  i 

V /  <  I  <  a'{a'=  A  V u'  4- 1  <  t  <  u(a*= 

i  I 

t 

V I  <  i  <  u'{o'= 


AVu'  +  l<.<u(a"=^] 
«' 
tt* 

A  Vu'  +  1  < »  <  u(a'= 

I 


We  must  supply  a  new  parameter: 


aetlon(C(zo))=:  V 1  <  s  <  n(o'=H(53i  ^o)l 

1 

aetlon(cr{z))  saetloa(Cr'(G»W),  C^+,(G,(z))) 

A’(EJ.«.0=/^(Er’ +  El'+i.*.‘)=®^Xt*+i.<?fW.»).  works  if  H(z.y)=*  +  y  and 

G,(z)s=  E*  +*>  but  the  latter  requires  having  Ei**  +*  available.  We  therefore  further  modify 
t.'ie  problem  by  requiring  the  coilection  of  another  value. 


«'r=E 

1 

=H{^r'.K'+i) 

»'  « 

=H(i:.  E) 

J  •'+! 

The  last  observations  we  need  (the  base  case)  are; 

i 

C’;=X,[Vi  <]  <  i[a'=z  +  53l!=X.(a'~«  +  «il 

t 

V<e  therefore  have  H{t,y)—x  -f-  y  making  «“=«»•’  +  CJ‘(z)  applies  Cf  to  z,  and 

to  z  vf.  Creating  new  symbols  for  the  values  (vi,  v,  and  u]  and  closures  (Ci,  C,,  and  C) 
received  from  the  subproblems  and  passed  to  the  superproblem,  we  finally  get  the  following; 
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H(*,y)  =* +  v 

V=Vl  4-  V, 
v.ltafi=ai 

G[C,,Cr)  =C,(Gi(2))  II  C,(G,(z)) 
G|(e)  =z 
G,(e)  =2  +  VI 

C./«a/i=X,[o'=r  +  a,| 
C.foot=X(^‘-^'  C0(0)] 


This  can  be  converted  to  a  decorated  tree  structure  by  sic.  \e  rewrite  rules. 

For  I'xample,  we  hare  G(G|,  C,)  =  G|(G|(2))  ||  G,(G,(2)).  We  would  therefore  hare  a  synthesized 
TRFE  declaration  to  read,  :a  part, 


inter  HAS  C,  v 

HEARS  leftson  (USES  C  as  Ci,  USES  v  as  U|} 
DEARS  rightton  (USES  C  at  C,,  USES  v  at  v^) 
TALKS  pannt  (SENDS  C,  SENDS  v) 


and  :he  program  for  the  internal  nodes  to  read,  in  part, 
(in  2'.tnter); 

C  ^  XJ . <^--^'(C,(G,(2))  II  C,(G,(2i)] 

where  Gi(2]=2 
where  C,(2)=i  U| 

V  -Vi-rv, 


SS2.r.3  Connected  Components 

The  problem  is  to  find  the  connected  components  ot  a  graph,  given  an  adjacency  matrix  (a 
matiix  A  in  which  ai;=true  iff  node  t  it  (directly)  connected  to  node  j  in  the  graph.  The 
adjacency  matrix  will  be  available  for  input  one  row  at  a  time,  and  a  solution  is  better  that 
reads  the  rows  at  constant  intervals. 

In  t.his  Subsection  we  will  derive  a  tree  structure  that  solves  part  of  the  problem  and  meets 
certain  worst  case  time  constraints.  The  derived  structure  will  operate  while  the  rows  of  the 
adjaiency  matrix  are  read  in. 

Forctally,  we  will  assume  that  there  exists  a  source  of  rows  of  the  adjacency  matrix  that  can 
prov.de  one  row  at  a  time.  Each  column  will  be  read  by  its  own  processor.  Columns  and  rows 
have  integers  in  the  range  [1,2, ...,n]  as  names.  When  column  t’s  processor  reads  row  j  it 
rece:  ves  the  value  true  if  there  is  a  graph  edge  between  t  and  ;  or  false  otherwise.  The  network 
we  cerive  will  then  store  the  information  in  such  a  manner  that  it  or  some  other  network  can 


2  Examplss  or  thi  U$s  or  Closurss 


2.2.  Drvmi-AND-CoNQUiR  with  Closuris 


id  entify  connected  components  of  the  graph  whose  adjacency  matrix  was  read.  The  identification 
p'ocess  is  not  the  issue  here. 

1  he  column  processor  nodes  of  the  network  must  read  elements  of  the  rows  of  the  adjacency 
n.atrix  at  such  a  time  (in  relation  to  the  time  other  processors  read  their  elements  of  the  same 
r^iw)  that  the  network  will  not  confuse  elements  of  different  rows  of  the  matrix,  and  the  net  must 
build  a  representation  of  the  the  (partial)  connected  components  information  in  some  useful 
manner.  The  representation  should  be  compact  and  the  computation  should  be  fast. 

First  we  will  derive  the  structure  up  to  one  important  implementation  decision;  then  we  will 
describe  the  two  resulting  parallel  structures. 


2. 2.3.1  Derivation  of  a  Tree  Structure 

In  the  connected  components  problem,  we  dc  not  necessarily  want  to  change  the  state  of  the 
leaves  of  the  tree  or  develop  a  value  at  the  root.  Instead,  we  want  to  change  some  state  so 
questions  about  connected  components  become  easier  to  answer. 

V/e  will  use  the  notation  CC(t)  to  denote  the  set  of  nodes  in  the  same  connected  component 
ai  the  node  t.  CC'{N)  is  a  predicate  indicating  whether  all  nodes  of  iV,  a  set  of  nodes,  are  in 
a  single  connected  component.  Since  the  state  of  knowledge  of  the  connected  components  of  a 
g'aph  can  vary  with  time  and,  in  a  multiprocessor  system,  with  location,  we  will  later  introduce 
o;her  variants  of  the  CC  predicate. 

We  will  read  the  rows  of  the  adjacency  matrix  one  by  one.  After  we  have  read  all  of  the  rows 
ve  will  then  engage  in  another  computation,  not  described  here,  to  put  reducemia  ecc(t)} 
in  leaf  i.  In  what  follows  we  will  call  the  processing  that  takes  place  between  the  reading  of 
consecutive  rows  of  the  matrix  a  phase. 

There  arc  several  solutions  to  the  connected  components  problem  which  we  reject  because  they 
have  certain  undesirable  features.  One  solution,  for  example,  would  be  to  have  each  node  record 
t!)e  row  numbers  of  all  rows  of  the  adjacency  matrix  in  which  it  is  mentioned.  This  would  require 
a  lot  of  storage.  Another  solution  is  to  have  each  leaf,  after  each  row,  find  redueesin{y.;  E  CC(t)} 
S  I  far.  This  solution  has  the  problem  that  the  time  between  the  reading  of  rows  can  vary  over 
a  wide  range. 

C  ur  derivation  requires  a  certain  amount  of  invention.  We  will  assume  that  the  user  provides 
tliis  by  defining  several  intermediate  predicates  and  by  providing  some  information.  First,  the 
idea  of  a  map  to  store  the  state  of  the  connected  components  so  far,  and  than  the  idea  that  the 
n  ap  is  limited,  have  to  be  conceived. 

V/e  start  with  axioms  about  connected  components: 

CC'de}) 

CC‘(0) 

CC'[A)  A  CC'{3)  A  Afl  5 7^  0  =>  CC\a\Jb) 

CC'(A)  A  A'  C  A=»CC'(A') 
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We  observe  that  the  following  is  trivially  true: 

CC'(A)  A  CC'(B)  A  3  a,  4[a  6  A  A  6  6  B  A  CC'{{a.  i})]  =»  CC'(A  (J  B) 

First,  we  supply  TransConS  with  a  divide-and-conquer  formulation. 

VV,W6VCC'(W0 

where 

CC'[W)  =  1W|<1 

V 

W=VVi  a  W. 

A  CC'(Wi) 

A  CC'(W,) 

A  (W,7^0  A  =»  <7C'({afbVFi.arbVV,})) 

Tr/  nsConS  can  easily  check  that  this  meets  the  axioms,  but  the  combination  of  the  two 
halves  by  a  pair  of  arbitrary  elements,  one  from  each  half,  constitutes  a  user-supplied  invention. 

Tr>  nsConS  observed  that  the  current  state  of  CC?  is  represented  by  the  choices  of  pairs 
of  arbitrary  elements,  and  introduces  M  to  carry  this  information.  Since  M  represents  the 
static  of  knowledge  of  connected  components,  we  will  define  a  new  binary  predicate  CC{M,X) 
which  denotes  that  the  mapping  M  asserts  that  there  exists  a  connected  component  C  such 
that  X  C  C.  Taking  a  finite  difference  against  the  addition  of  a  new  set  X  that  is  known  to  be 
conrected,  we  get; 

VA:,M3A/'[CC(Af',A')  A  VVV’(CC(,M,Vn  =»  CC(M',W)1 
A  Vo,i(-CC(M,{a,6}) 

A  vr,z[cc(w,{o}un  a  cc[,m,{i}\jz) 

^YC\X=9  V  zhX=0] 

=s  CC(M',  {a,  6}))I 

where 

CC{M,W)  =  |1V|<1 

V 

W=W,  W  W, 

A  CC(M,Wi) 

A  CC{M,W,) 

A  {W,^e  A  W,7d0 

=»3a€Wi.J6VF,[M(a,4)l) 

The  long  conjunct  on  the  second  through  fifth  lines  state  simply  that  no  connected  components 
arc  implied  by  M'  that  aren't  either  implied  by  M  or  forced  by  X. 

We  .nvite  the  user  to  make  another  critical  observation,  namely  that  VW[CC(M,  W)  ^ 
CC{M' ,W)]  can  be  satisfied  by  Va, i(W(a, 6)  =»  A/'(a,6)).  (S)he  can  further  observe  from  the 
original  axioms  that  CC({a,6}  A  o€A  A  CC({6,c})  A  c^C  CC(,A\JC).  We  can  thus 
liberalize  the  condition  on  M  in  CC  as  follows: 

'^X,M3M‘[CC[M\X)  A  M{a,b)  =*  M'[a,b)  A  ...] 
where 

CC{M,W)  =  |W|<1 


3  Examples  or  the  Use  or  Closures 
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V 

W=Wi  1*1  w, 

A  CC(M,Wi) 

A  CC{M,W,) 

A  {Wi^0  A 

=»  =  aeWi,4[A/(a,i)  A  (&€W;  V  C(7(W,{6}UW;))!) 

1  his  specification  is  suboptimal  because  it  allows  M  to  be  multlTalued.  We  vrill  examine  this 
solution  in  detail  and  see  how  it  translates  into  decorated  trees  that  maintain  M  in  internal 
s  ate.  We  will  then  see  what  can  be  done  to  improre  this. 

We  therefore  make  a  change  in  CC  to  express  the  fact  that  the  diyisions  will  always  be  made 
ii.  the  same  manner,  and  that  M  need  onlj  be  defined  for  one  set  of  subsets  of  the  universe. 

1  his  change  is  the  addition  of  a  parameter,  a  subset  of  the  universe  (of  nodes  in  the  graph 

V  hose  connected  components  we  are  seeking).  Later  we  will  repair  another  deflciencj  of  this 
s'lecification,  that  it  allows  M  to  be  larger  than  we  would  like. 

M  will  be  made  a  ternarj  rather  than  a  binary  relation.  M{S,  a,  4)  is  true  if  a  connects  to  4 
r  .-lative  to  S.  The  purpose  of  this  is  to  limit  the  sise  of  M. 

/  new  parameter  to  CC  ranges  over  particular  subsets  of  the  universe.  It  has  two  roles;  it  tells 

V  hat  version  of  M  to  use,  and  it  restricts  acceptable  solutions  to  CC.  CC(S,  M,X)  is  true  only 
ii  there  exist  elements  of  M(S',x,y),  where  S'  Q  S,  that  show  that  X  is  connected.  This  is  a 
s  ronger  condition  than  the  original  CC{M,  W). 

lo  formalise  the  new  parameter  of  CC  we  write: 

VV,X,A/,W6V3M'V a, 4(CC( W, ^  CC{M',X)  A  M(S, a, 4)  =»  M'(S, o, 4)] 
where 

CC{M,W]  =  cciu.M.vr, 
and 

CC(S,M,W)  =  \W\<\ 

V 

W,=wnL{S) 

A  w;=W’n«(5) 

A  CC{L{S),M,Wi) 

A  CC(R(S),M,Wr} 

A  A  W,^0 

=»  =  a€W,,4€W;[W(S,o,4)i) 
and 

L(5)  W  fl(5)=S 

^  ow  we  can  perform  a  synthesis  by  transforming  s»tlsfy(V  V,  X,M,W^V3  M'  Vo,  4[(7C(A/,  W)  A 
CC[M',X)  A  M(S,a,b)  =»  A/'(S,a,4))).  This  works  with  no  problems.  We  soon 
find  ourselves  transforming  satlsfy(Af'(5,o',4')).  However,  this  causes  no  problem, 
■iei. 

Suppose  we  add  an  additional  condition,  M{S,  a, 4)  A  M(5,  o,  e)  =»  4=c.  We  start  with  this:  (we 
have  replaced  occurrences  of  M  by  occurrences  of  M',  as  the  constraint  propagator  would  do 
vhen  analyzing  “CC(S,  iVf', IV)*.) 
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A  {m9^0  A 

=»3oeVVi.66W,[A/'(5.a.4)  A  Ve(M'(S.o,e)  =»  c=6ll) 


This  last  clause  makes  us  a  bit  unhappy,  when  considered  together  with  the  expression 
M(£-,a,b)=^  M'{S,  a,  b) 

However,  we  have  M(S,  a,  e)  CC{S,  {a,  e})  and  CC{R(S),  {e}  U  W,)  A  CC[S,  {a,  e})  =» 
CC(S,{o}UVV;). 

We  therefore  use  V  to  expose  the  fact  that  there  are  alternatives: 


A  A  W,yA0 

=»3a€W,,i€W;((M'(S,o,6)  A  ,He5^fclM(5.o,c)] 
V  3c[M'(5,o,£)  a  CC(S.{c}UW;)])]) 


As  it  is  known  that  M(S,  a,  x)  can  only  be  asserted  by  the  above,  an  inductive  proof  it  available 
that  c  €  R(S).  This  can  therefore  be  replaced  by 


A  (Wiy^0  A 

=»3o€W,,6eW,((M'(5,o,6)  A  /!ej£blM(S,a.c)] 

V  3e[A/'(5,o,e)  A  CC(/?(5),{e}U  W,)])]) 


This  gives  two  alternative  ways  to  satisfy  the  specification.  We  can  satisfy  Af'(S,a,b)  if 
M(S,a,b)  V  a,  e)].  satisfying  the  other  disjunct  it  harder  than  this  because  it  re¬ 

quires  satisfaction  of  a  predicate  containing  R(S),  to  we  prefer  the  first  disjunct  when  it  can 
be  satisfyed.  If  we  can’t  use  the  first  disjunct,  then  we  know  3e[Af(S,o,c)]  so  we  have  only  lo 
satisfy  CC(/f(5),  {cjlJVV’r)  for  that  e.  This  leads  to: 

tatltiy(3o€  W,,J€W, 

[(M'(S,a.i)  A  ^e^blM(S,a.e)}  V  3 c(A/'(5, o, c)  A  CC(«(5),{e}U  W,)])]) 

-»  bind  a  to  arb(W)),  b  to  arb(W,)  in 

if  M'(S,  a,  b)  V  !M(S,  a,  e)]  then  satisfy(A/'(5,  o,  6)) 
else  satlsfy(A/(5,a,e)  =0  CC(R(S),{e>UW;)) 


2.2.3.2  Alternative  Data  Structures 

It  is  now  necessary  to  consider  the  options  for  storing  M.  The  type  ot  M  it  TxU  -*  U,  where 
U  is  the  set  of  nodes  in  the  graph  whose  connected  components  are  being  determined,  and  T  is 
a  set  of  sets  such  that  1/ €  T  A  (5  £ T  A  |5|  >  1  =»  R{S) A  L{S) £ T).  The  genesis  of  T  is 
such  that  each  intermediate  node  plus  the  root  of  the  tree  has  as  its  set  of  leaves  some  element 
of  T  if  each  element  of  U  is  represented  by  a  leaf. 


2  Examplbi  or  THi  Uti  or  Cloiurbs 


2.2.  DlVn>B>AND>CON«UBR  WITH  CbOtURSt 


Because  of  the  type  of  M,  we  hare  four  simple  options  to  represent  the  mappinc;  W«  can 
r'jpresent  it  in  one  processor’s  memory,  in  the  memory  of  one  processor  per  element  of  T,  in  one 
processor  per  element  of  U ,  or  in  one  processor  per  element  of  7*  X  f/.  The  first  possibility  would 
lack  concurrency  and  the  last  would  require  too  many  processors.  The  remaining  possibilities 
include  using  interior  nodes  of  the  tree  (corresponding  to  elements  of  7)  or  leares  (corresponding 
to  elements  of  V)  as  the  repository  for  information  about  parts  of  M. 

Inspection  of  the  specification  yields  the  information  that  the  tree  node  representing  a  set  5  must 
be  able  to  answer  questions  of  the  form  3e{M(S,a,e)  A  e^6}  and  find  c  suehthat  M(5,a,c), 
and  must  be  able  to  satisfy(A/(5,a,i)).  This  requires  either  keeping  M{S,s,y)  in  5’s  node  or 
providing  that  node  with  appropriate  closures. 

That  node  must  also  be  able  to  tatisfy(CC(£,(S),  to  satlsiy(CC(it(5),  and 

to  satlsfy(CC(/i(5),  Af'.eU  W';))  fii^en  eSR(S)  A  CCIR{S),M‘,W,).  This  requires  another 
handful  of  closures. 

Since  closures  to  satlsfy(CC(L(5),  Af'.Wi))  and  si^lsiy (CC ( A(5),  Af'iW,))  would  require  only 
information  available  below  L(5]  and  R(S)  respectively,  and  since  there  is  no  control  fiow  path 
by  which  the  need  to  satisfy  those  two  predicates  would  be  evaded,  we  observe  that  each  interior 
node  requires  a=arb  IVi,  i=arb  W,,  and  the  closure  X^'>*'*[satl^(M'(ff(5),  a,s))]. 

V/e  are  building  a  map  that  maps  at  most  one  leaf  of  the  right  subtree  to  each  leaf  of  the  left 
subtree.  As  described,  the  map  is  stored  in  the  node  that  has  the  appropriate  subtrees.  However, 
other  alternatives  are  possible. 

There  are  three  natural  places  to  store  the  assertion  They  are  the  node  whose 

subtree’s  leaves  are  S,  leaf  u  and  leaf  b.  If  the  information  it  stored  in  S,  there  mutt  be  one  cell 
for  each  leaf  of  the  left  subtree,  and  if  the  information  is  stored  in  a  then  there  mutt  be  one  cell 
for  each  ancestor  representing  5.  If  the  information  is  stored  in  6  we  have  no  limit  (beyond  the 
Site  of  the  problem)  for  the  amount  of  storage  that  mutt  be  provided  in  t.  We  therefore  reject 
this  alternative. 

Storing  M  in  the  node  heading  5  mimmiset  communication  (information  is  where  it  it  used) 
making  the  algorithm  take  O(logn)  steps.  These  steps  are  not  conttant*time  steps  because 
they  require  access  to  a  random  access  memory  whose  site  it  0(n),  itself  an  O(logn)  operation^ 
.  The  algorithm  therefore  has  an  0(log^  n]  running  time. 

The  result  could  be  transformed  to  place  the  fact  of  M(S,a,b)  in  a.  This  would  result  in  a 
different  algorithm,  one  that  requires  the  leaves  to  supply  closures  to  access  and  modify  the 
map. 

I  here  is  an  interesting  problem  here.  We  would  prefer  that  the  leaves  not  have  to  know  about 
elements  of  7.  It  would  therefore  be  necessary  to  have  the  M  table  within  each  leaf  organised 
in  a  certain  order  and  to  have  use  made  of  this  information  in  that  fixed  order.  This  requires 
that  a  ‘fiame  front*  of  subtree  handling  be  arranged  such  that  initially  the  root  is  the  tree  for 
V  hich  you  are  trying  to  associate  pairs  of  elements,  and  on  succeeding  subphases  the  level  at 
%•  hich  we  are  trying  to  match  descends.  This  algorithm  has  an  0(log^  n)  execution  time  because 
there  are  Ign  subpbases,  each  of  which  is  O(logn). 


2 

*The  constant  factors  are  such  that  this  is  probably  not  a  serious  issue.  If  the  problem  instance  is 
arge,  say  >  or  so,  the  RAM  access  time  might  be  slow.  However,  much  of  the  communication 
oetween  tree's  processors  would  then  be  off-chip,  msdcing  interprocesior  communication  even  slower.  If 
ohe  problem  instance  it  small,  the  RAMt  in  each  processor  would  be  small  enough  to  make  their  access 
'.ime  comparable  to  ordinary  logical  elements  in  the  processor.  Only  tor  a  truly  immense  problem 
instance,  say  2**,  would  the  memory  access  time  dominate  the  communication  time. 


3.  ExAMfLis  or  THi  Uh  or  CLOtumi 
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We  prefer  the  former  data  structure,  in  which  M{S,  a,  h)  ia  represented  in  5,  because  the  issue 
described  in  the  previous  parafraph  does  not  arise.  That  structure  will  alwajs  be  availabie  to 
us  u  nless  the  sise  of  a  change  to  M  is  proportional  to  the  sise  of  5,  and  this  can  not  be  because 
the  :ombination  step  of  the  divide  and  conquer  scheme  must  be  fast  for  the  specification  to 
paraiieiise  weil  in  a  tree  structure. 
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3.2.3.3  Results  of  Storing  the  Map  in  the  Leaves 

This  Subsubsection  will  diKuss  the  algorithm's  response  to  a  single  row  of  input. 

The  parallel  structure  is  (informally)  as  follows; 

There  is  a  balanced  binary  tree  of  processors.  The  leaves  of  the  tree  correspond  to  the  nodes 
of  the  graph,  and  they  are  ordered  in  the  order  that  corresponds  to  the  arrivals  of  rows  of  the 
adjacency  matrix.  (This  last  fact  it  not  important.)  For  simplicity  of  exposition  we  will  write 
the  following  as  if  the  leaves  were  rather  than  ‘corresponded  to*  the  nodes.  For  simplicity  we 
will  assume  that  the  entire  adjacency  matrix  it  tuppli^,  rather  than  only  a  triangular  matrix. 

The  leaf  nodes  build  approximations  to  the  answer  at  the  algorithm  grinds  on.  Each  leaf  node 
hat  me  memory  cell  for  each  ancestor.  Consider  the  memory  cell  for  ancestor  a  in  leaf  li.  It 
is  in  tialiied  to  the  distinguished  value  ail,  and  during  the  course  of  the  algorithm  it  will  come 
to  contain  tome  j  such  that  LCA(j',t)=a  and  t  and  j  are  known  to  be  in  the  tame  connected 
component,  provided  that  some  such  j  exists. 

The  algorithm  works  as  follows;  A  leaf  it  called  active  if  its  bit  it  set  in  the  current  row  sf 
the  adjacency  matrix.  After  a  row  is  read  in,  information  it  pasted  upward  to  each  node  can 
dete.’mine  whether  both  of  its  subtrees  contain  active  leaves,  and  what  the  highest  and  lowest 
active  leaves  are  for  such  nodes.  Information  is  than  passed  downward  so  each  internal  (or  root) 
node  can  determine  whether  it  is  the  top  such  node.  That  node  sends  a  message  to  those  two 
extr>!me  nodes  informing  them  of  each  other’s  identity. 

The  following  cycle  is  repeated 

TU  computes  spans,  TD  distributes  span  information  and  keeps  track  of  the  topness  of  nodes. 

TU  istype  TREE  (i),  i6[l,  •..,«]  sise  n 

root  HAS  msnoet,  mazaef,  topp,  Ustop,  riatop 
HEARS  leftaon  (uses  upmin) 

HEARS  rightaon  (uses  upmax) 

TALKS  leftaon  (sends  lieSop) 

TALKS  rightaon  (sends  riatop) 
inter  HAS  minaet,  mazaet,  topp,  liatop,  riatop 
HEARS  leftaon  (uses  upmin) 

HEARS  rightaon  (uses  upmax) 

TALKS  leftaon  (sends  Itstop) 

TALKS  rightaon  (sends  riatop) 

TALKS  parent  (sends  upmin) 

(tends  upmax) 

leaf  HAS  aetii/e,,  eemate,y,y€  ancestors 

HEARS  INPUT  (uses  odjy;  €(1, .  • . , n)) 

TALKS  parent  (tends  upmin) 

(sends  upmax) 
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2  EXAMFLla  or  THB  USB  OF  ClOSURBS 


2.2.  Dividb-and-Conqubr  with  Closurbs 


(in  TU.leafi) 

'/ye  RBeettori 
eemattij  *-  nil 

‘'y€«i.....B» 

temp  *-  Oij 

upmtn  *-  upmax  *-  if  temp  then  t  else  si! 

dmin  *-  downmin 

dmax  —  liownmax 

other  *-  nil 

pivot  *-  pivot 

if  dmin=i  then  other  *-  dmax 
If  dmax^i  then  other  *-  dmin 
if  other  nil  then 
if  ecmate, ,pi,,i=Bil 

then  atvaJben  *-  nil;  eemate,,p.«*«  other 
else  awaken  <—  eematei.pi««t 

(;  n  TU. inter) 

;  firit  establish  my  status 
{{Irangel.lrangeh))  *-  Irange 
{{rr angel,  rrangeh))  *-  rranje 

range  ({mia(lrangel,lrangeh),mxx(rrangel,rrangeh))) 
livep  *-  rangex  A  ranpea 
;  This  is  a  once— per— minor— phase  activity 
while  dstatus  T^’dead 


(•.n  TU.root) 

{{lrangi>l,lrangeh))  Irange 
{{rrangel.rrangeh))  *-  rrange 

range  *—  {{m\n{lrangel.lrangeh),mxx{rrangel,rrangeh))) 
livep  «-  ranpei  A  rangex 
while  dstatus  j^'dead 


(i  n  TD.inter) 
if  pstatus  €{’U^e,  ’top} 
then  status«- 'live 

range  *-  prange 
eUeit  livep  then  status*- ’top 

ranpe  *-  ranpe 
else  status*- ’dead 
while  status  ^’dead 


(:n  TD.raot) 

\t  livep  then  status*- ’top 

range  *-  range 
else  status*- ’dead 


3.  Examflis  or  THi  Uti  or  Cloiukis 


3.2.  DiVIOB'AND'CONqUBII  WITH  CLOtUII*S 


EacL  minor  phue  the  leaves  tent  up  awakening  info  and  get  back  a  packet  of  info  very  timilar 
to  tLe  one  they  got  in  the  beginning. 

Each  leaf,  when  it  dies  (finds  out  that  the  node  just  above  it  it  dead)  tends  up  an  "init*  message. 
When  every  node  has  done  so  the  rood  broadcasts  its  own  form  of  “init*  and  the  leaves  read 
from  the  I/O  processor  that  contains  the  next  row  of  the  adjacency  matrix. 

Here  we  describe  the  overoH  behavior  of  the  algorithm,  considering  the  parallel  structure  to 
be  a  single  entity  that  can  do  thinp  seq:uentially.  To  actually  have  this  effect,  there  are 
sync nronication  problems,  and  below  we  describe  a  nodes’  eye  view  of  the  situation,  inc*' /^ing 
the  'vork  that  each  node  has  to  do  to  coordinate  with  its  neighbors. 

Initi  ilise;  Have  each  node  read  in  its  element  of  the  adjacency  matrix.  Those  nodes 

reading  a  ‘’I*  in  the  adjacency  matrix  turn  themselves  on,  as  does  the  node 
whose  index  corresponds  to  that  of  the  row  of  the  matrix.  Mark  the  root  as 
the  “focus*. 

Survey;  Every  leaf  sends  information  telling  whether  it  is  awake.  Using  this  infor¬ 

mation,  the  internal  nodes  below  a  focus  find  out  which  of  them  has  awake 
descendants  in  each  of  the  two  trees  (“has  two  active  subtrees*).  This  is  a 
straightforward  “up*  problem. 

New  root:  The  highest  node  with  two  active  subtrees  it  determined.  This  it  the  Least 

Common  Ancestor  (LCA)  of  active  leaves.  It  becomes  the  new  focus,  nodes 
between  it  and  leaves  become  “active*,  and  nodes  above  it  but  below  and 
including  the  old  focus  become  “dead*. 

Tournament:  Select  an  arbitrary  active  leaf  node  in  each  of  each  focus’s  two  subtrees. 

Report  the  identities  of  the  two  leaves  to  their  focus.  Simultaneously  report 
the  identity  of  the  focus  and  of  the  other  leaf  to  each  of  the  two  leaves. 

Lookup:  The  leaves  contain  a  variable  mapping  mapping  their  ancestors  into  a  leaf 

index  or  the  distinguished  value  nil.  The  leaves  look  up  the  focus  in  this 
mapping.  If  it  is  nil,  they  store  the  other  leaf’s  identity.  If  the  left  leaf’s  value 
is  not  nU,  report  the  value  to  its  focus. 

New  awakening;  If  its  left  tree  reports  a  leaf  ID  per  Lookup,  a  focus  sends  a  message  to  that 
leaf  commanding  it  to  awaken. 

Refocus;  Each  focus  sends  a  message  to  those  of  its  children  that  are  not  leaves  telling 

them  to  become  new  focuses,  and  dies. 

Repeat  (Maybe):  If  not  all  leaves  have  a  dead  parent,  go  back  to  New  Root. 

As  can  be  seen  above,  the  algorithm  has  several  subphases,  u  the  focus  moves  down  towards  the 
leavt^s,  and  each  of  these  subphases  has  several  sub-sub-phases:  Survey,  New  root.  Tournament, 
Lookup,  New  Awakening,  Refocus,  and  Repeat  (maybe).  Internal  nodes  of  the  tree  have  the 
status  dead,  focus  or  live,  and  leaf  nodes  either  have  status  awake  or  asleep.  The  behavior  of 
each  node  during  each  sub-sub-phase  will  be  deKribed. 

Survey:  Leaves  tell  parents  whether  they  are  active.  Intermediate  nodes:  (live  and  focus  only) 
Get  status  from  descendants.  Remember  and  (live  only)  tell  parent  how  many  subtrees  have  one 
or  more  active  subtrees.  Remember  which  subtree  was  active  if  exactly  one  was. 

New  root:  If  a  focus  has  two  active  subtrees  it  tells  its  left  (resp.  right)  child  “focus  above 
yon-={node),  you  are  left  (resp.  right)”.  If  it  has  one,  tell  that  one  “focus  at  or  below  you*  and 
the  ittber  “die”.  It  can’t  have  none. 
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2  Examplis  op  thi  Usi  op  CLOtumst 


3.3.  Oiviob-ano-Con«ubii  with  Closurm 


IiitermediRta  nodes  belovr  a  focus  (i.a.,  those  nodes  that  are  Utr)  listen  to  their  parents.  If  one 
hears  ‘die*  it  dies.  If  one  hears  *focus  abore  szxx ...”  it  relajs  the  message  and  becomes  or 
remains  lire.  If  one  hears  *focus  at  or  below*  it  acts  like  in  the  paragraph  above. 

L  eaves  that  receive  a  *die*  message  send  their  parent  an  *I  died*  message  and  prepare  to  read 
t.ie  next  line  of  the  adjacency  matrix. 

Active  leaf  nodes  record  the  name  of  their  focus. 

Tournament  and  Lookup:  Each  leaf  contains  a  mapping  M  relating  the  name  of  each  of  its 
ancestors  to  either  nil  or  the  index  of  a  leaf.  A  sleeping  leaf  node  sends  nil  to  its  parent.  An 
awake  leaf  node  i  that  receives  a  *focus  above  7ou=(node),  you  are  left*  message  sends  to  its 
parent  either  ((empty, 0)  >f  A/(node)=snil,  or  ((loaded,  Af (node))).  If  it  receives  *focus  above 
y)u=(node),  you  are  right*,  it  sends  t  to  its  parent. 

A  live  internal  node  which  receives  nil  from  both  children  sends  the  same  to  its  parent,  one 
tUat  receives  something  else  from  one  child  sends  that  value  to  its  parent,  and  one  that  receives 
nan- nil  values  from  both  children  sends  either  to  its  parent.  The  correctness  of  the  algorithm 
d  aes  not  depend  on  this  :hoice,  which  can  be  random,  pseudo>random,  or  consistent. 

bach  focus  receives  a  message  from  each  child.  Say  the  right  child’s  message  is  j.  If  the  left 
c  iild's  message  is  ((empty, t)),  then  {(record, /oeu«,t,»)  is  sent  to  the  left  child  and  nil  is  sent 
t')  the  right.  If  the  left  child's  message  is  ((loaded, i}),  then  all  is  sent  to  the  left  child  and 
(  awaken,  t))  is  sent  to  the  right. 

L  ookup  and  New  Awakening:  Internal  nodes  relay  parents*  messages  to  their  children. 

I'  leaf  node  i  receives  ((record, /oeus,i,y))  it  sets  M(/eeus)  *-  j.  If  it  receives  ((awaken,  i})  it 
a  vakens.  (If  •  doesn’t  match  it  does  nothing.) 

F  efocus:  Each  focus  sends  its  children  a  'become  a  focus*  message  and  dies.  A  live  node 
r<;ceiving  such  a  message  from  its  parent  changes  its  status  to  'focus* .  A  leaf  receiving  such  a 
n.essage  form  its  parents  sends  the  latter  an  *I  died*  message. 

Repeat  (maybe):  At  all  timet,  a  node  receiving  two  *1  died*  messages  tends  one  upward.  If  a 
n  }de  receives  a  ‘become  a  focus*  message  it  sends  its  children  a  'begin  survey*  message.  Live 
ii.termediate  nodes  relay  such  a  message,  and  leaf  nodes  receiving  a  *begiD  survey*  message 
proceed  as  in  Survey. 


3.2.3.4  Results  of  Storing  the  Map  in  Internal  Nodes 

1  he  tree-structured  algorithm  of  2.2.2.3  uses  0(log^  n)  time  per  row  of  the  adjacency  matrix. 
More  importantly,  this  constant  factor  includes  a  communication  between  adjacent  nodes.  It 
i!  impossible  to  do  better  assuming  that  the  information  required  to  reconstruct  connected 
c  )mponeats  is  to  be  kept  in  the  leaves  and  that  there  is  only  to  be  a  logarithmic  amount  of 
ii.formatioa  in  eech  leaf.  The  reason  for  this  is  that  the  action  taken  by  the  right  subtree  of  a 
g  ven  node  depends  on  information  present  only  in  the  left  subtree,  and  that  the  right  subtree’s 
r  ecursive  analysis  of  its  pattern  of  leaves  to  be  linked  can,  in  turn,  depend  on  the  results  of  this 
ft  edback.  We  therefore  have  a  logarithmic  number  of  steps,  each  of  which  takes  0(log  n)  time. 

It  is  possible  to  reduce  the  constant  factor,  but  only  by  distributing  the  information  differently. 

I'lStead  of  having  a  cell  in  each  leaf  for  each  of  its  ancestors,  suppose  we  have  a  cell  in  each 
ancestor  for  each  of  its  leaves.  The  same  number  of  cells  are  required,  one  for  each  leaf/ancestor 
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3.  Examples  op  the  Use  op  Cloeuree 


3.3.  OlVIOB'AND'CONQUER  WITH  CLOSURES 


ptir^.  Each  internal  node  contains  a  map  which  maps  names  of  leares  of  the  left  subtree  into 
either  nil  or  names  of  learet  of  the  risht  subtree. 

The  orerall  riew  of  the  algorithm  is  as  follows; 

Each  leaf  sends  its  parent  its  name  if  its  actiee,  or  nfl.  Each  intermediate  or  root  nodes  sends 
its  parent  either  the  name  of  any  actire  node  it  receives  from  its  children,  or  nil  if  it  receives  nil 
from  both  children.  If  it  receives  two  names  it  chooses  arbitrarily.  Each  intermediate  node  also 
rem»mbers  what  it  received  from  its  children. 

In  addition,  suppose  it  receives  a  name  from  both  children.  There  are  two  cases.  If  the  name 
from  the  left  node  maps  (in  the  node’s  internal  mapping  from  leaves  to  values)  into  nil,  make 
it  map  into  the  name  from  the  right  node  and  do  nothing  else.  If  it  maps  into  (say)  t,  send 
awaken  t  to  the  right  child  and  do  nothing  else. 

If  at  intermediate  node  receives  an  awaken  t  node  from  its  parent,  it  checks  to  see  whether  t  is 
in  it.i  right  or  left  subtree.  It  also  checks  to  see  what  it  has  received  before. 

If  a  node  receives  an  awaken  t  message  and  has  already  received  a  name  from  t’s  subtree  it  sends 
awaken  i  message  to  the  appropriate  child.  If  it  hasn’t  so  received  it  considers  itself  to  have  so 
received.  (This  can  involve  reacting  to  further  awaken  messages,  or  it  can  involve  looking  up 
either  t  (if  i  belongs  in  the  left  subtree)  or  the  previously  received  name  (if  t  was  in  the  right 
subtree  and  the  previous  name  was  in  the  left)  in  the  mapping  and  either  extending  the  mapping 
or  creating  a  new  awaken  message.)  m 

The  root  sends  its  children  ’’ok”  when  it’s  done.  Intermediate  nodes  relay  such  'okay*  messages. 
Each  leaf  reads  the  next  line  of  the  adjacency  matrix  when  it  receives  this  ok,  and  starts  a  new 

cycle. 

The  'wrapup”,  where  each  leaf  gets  the  name  of  a  representative  of  its  connected  component, 
is  also  faster  under  this  arrangement.  The  root  sends  its  right  child  its  correspondences  one  by 
one,  followed  by  ‘end* .  When  a  node  receives  o  -»  6  it  replaces  >  c  (if  it  has  one)  by 
a  -<  c.  This  is  not  done  for  )  -*  nil.  Intermediate  nodes  also  relay  correspondences  received 
from  parents.  When  an  intermediate  node  receives  'end*  from  its  parent,  it  dumps  its  own 
corre-spondeuces  as  they  now  stand  and  then  sends  its  own  'end*.  A  leaf  node  initializes  a  cell 
to  its  own  name  and  a  cell  named  6  changes  this  value  to  a  if  it  receives  a  6.  A  leaf  node 
knows  it  has  the  right  value  when  it  sees  *end* . 


’’and  it  should  lay  out  reasonably  nicely  because  the  bigger  nodes  are  closer  to  the  root  of  the  tree 
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Chapter  S 


Use  of  Additional  Teclmiqnes  -  Binary  Addition 


$3.1  Notatioa 

In  rrh&t  follovrs,  yn  will  auume  that  a  problem  instance  resides  in  rectors  A  and  B,  each 
containing  individual  ‘bits*  o,  resp.  t,-  for  0  <  t  <  n — 1.  The  two  states  of  a  bit  are  represented 
bf  the  values  0  and  1.  This  discussion  is  specialized  to  binarj  integers,  but  anj  radix  can 
bs  used  bjr  reinterpreting  the  logical  operators  as  follows:  0  =  +  (mod  (the  radix)),  A= 
+  y  >{the  radix)],  — =  X,[(the  radix)— xj,  and  v=  X,,y[z  +  y  >(the  radix— 1)].  We 
apply  logical  operators  to  the  values  0  and  1,  interpreting  0  as  false  and  1  as  true.  A  represents 
lio  s  a-i  “O'*  ®  likewise.  The  answer  is  similarly  represented  in  C.  We  will  have 
occasion  to  refer  to  earryi,  the  carry  coming  into  position  i.  We  use  0  as  the  symbol  for 
“exclusive  OR*. 

Cur  starting  point  for  all  of  the  syntheses  in  this  paper  will  be  the  specification: 

;  we  want  to  add  A  +  R  where  A=a„_i . . .  oioo  and  B  similarly. 

V0<i<n— I 

c,=a.  ®  6.  ®  3  [o;  A  A  V  (o*  V  »»]] 

j<»  i<fc<» 

Figure  2.  Our  ‘Standard*  Specification  of  Binary  Addition 
A  derivation  of  this  specification  from  the  ‘grade  school*  specification  for  addition 


earryo=0 
V0<i<n— I 
e(=a,  0  hi  0  carry,- 

earrv,-+i=(corry,-  A  (o,  V  »,))  V  (o,  A  6,) 


Figure  3.  ‘Grade  School*  Specification  for  Binary  Addition 


S.  Bi  MAKV  Addition 


S.3.  Carry  Look-ahbad  Circuit 


Figwt  3-  ‘Grade  School*  Specification  for  Binary  Addition 

is  b(  7ond  the  scope  of  this  paper,  although  a  derivation  of  the  latter  from  the  former  will  be 
briei.y  sketched. 

Frequent  reference  is  made  of  a  system  called  TransConS.  This  is  the  TRANSformational 
Co^.currenc7  Synthesiser  (aescribed  elsewhere  [IQng-S3|  [KingMajri^84])  which  we  are  develop¬ 
ing  i't  Kestrel.  TRANsCoNSis  an  architecture  synthesis  system  which  can  be  used  to  transform 
high  level  specifications  into  parallel  structures.  Its  features  of  interest  here  include  the  ability  to 
syntuesise  tree  structured  processor  networks  from  specifications,  and  the  ability  to  reformulate 
concurrent  computations  by  reorganising  the  work  differently  among  a  collection  of  processors. 


§3.2  Carry  Look-ahead  Circuit 

Consider  the  definition  of  Figure  1.  The  problem  with  directly  synthesising  solutions  to  this  by 
the  methods  of  TransConS  resides  in  the  nesting  of  quantifiers  such  that  the  bound  variable 
of  the  outer  quantifier  is  one  end  of  the  range  of  the  inner  one.  The  reason  this  is  a  problem 
is  that  it  forces  the  computation  of  3{n^)  boolean  values,  namely  V  iaj  for  oach 

0  <  y  <  i  <  n— 1  (a  total  of  n(fi— 1)/2  {i,j)  pairs). 


§§3.2.1  Quantifier  Levelling 

When  the  following  equivalences  are  applied  successively  (brief  proofs  appear  in  the  Appendix) 


V  (i*(2)]  =  max[~  f*(*)l  (V  •to-maz) 

:<£<»  x<* 

3  (P(z)  A  F(u)  <  z]  =  3  lP(z)]  {eonstraint-to-hinder) 

i<«  r(*)£*<ii 

3  [F(z)]  =  max(P(z)]  >  I 

l^z<u  *<» 


(3-to-moz) 


Me  get  the  following  sequence  of  assignments  to  e,-  (changes  underlined); 


®  ®  3  [ttj  A  ij  A  V  [a*  V  &*]) 

j<k<i 

=»  e,=Oi  ®  4,-  ®  3  (a,-  A  4>  A  max[<>*  (o*  V  4*)]  <  j] 

j<i  k<i _ 


Ct=«i  0  4, 0 


«»*»<«l~(»»v»*)l  Si<» 


A  4yl 


e.=Oi  0  4,  0  max(oj  A  4y!  >  max^  (“k  V  4*)] 

}<i  *<«■ 


It  should  be  observed  that  the  first  transformation  solved  the  basic  problem  of  the  need  to  com¬ 
pute  ^(n^)  values,  and  that  after  the  last  transformation  it  is  possible  to  do  both  enumerations 
ii,  parallel.  The  bound  variable  of  one  enumeration  is  no  longer  an  endpoint  of  the  range  of  the 
o.;ber.  This  means  that  where  we  would  previously  have  had  approximately  n^/2  data  items  to 
c -insider,  we  now  have  approximately  2n. 

V/e  now  have 


V  0  <  t  <  n— 1 

c,=a,-  ®  4<  ®  max(aj-  A  4j-l  >  max[—(ak  V  4]k)l 
/<•’  *<• 


It  is  possible  to  express  this  as  an  inequality  between  corresponding  elements  of  the  results  of 
t'VQ  parallel  prefix  computations  as  follows; 


V0<i<  n— 1 
andj=a,  A  4,- 
nor,-=  ~(a,-  V  4,-) 

maxlandi—  If  *ndi  then  t  else  — oo 

maxlnor,-3=  If  nor,-  then  i  else  — oo 

maxandiss  max  [mazlandy] 
osj« 

maxnor,=  max  [mazlnory] 
osj<* 

C,s=a,'  0  4,'  0  {maxandi  >  maznor,-) 


1  ransConS  will  be  able  to  synthesise  the  usual  parallel  prefix  tree  structure  [Browning>80] 
for  each  of  the  two  lines  marked  by  an  asterisk  above.  Most  of  the  details  of  this  synthesis  are 
beyond  the  scope  of  this  paper,  but  the  tree  structure  comes  from  uses  of  divide- and-conquer. 
The  intermediate  steps,  taken  from  [KlngMayr^84],  are  shown  in  the  Appendix. 

There  are  two  parallel  prefix  trees  in  the  addition  parallel  structure;  one  for  the  variable  named 
maxand  and  another  for  maxnor.  The  overall  structure  is  shown  below. 
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There  are  two  important  differences  between  this  structure  and  standard  ones  [Hwang>70]. 

*  Because  the  parallel  prefix  trees  are  required  to  handle  integers  in  the  interval  [0,  n],  the  size 
of  the  nodes  and  the  width  of  the  data  paths  within  the  trees  are  tf(lg(n)).  In  the  standard 
network  it  would  be  0(1).  This  can  be  alleviated  b/  some  careful  reasoning,  to  be  described 
below. 

>  Because  of  the  nature  of  the  parallel  prefix  network  synthesized  by  TransConS,  each  node  is 
pr  rtially  responsible  for  the  choreography  in  its  local  region.  The  importance  of  this  fact  is  that 
eii  her  the  nodes  need  be  big  enough  to  participate  in  an  asynchronous  data  transfer  protocol 
with  a  handshake,  or  a  global  clock  must  be  provided.  This  is  not  a  serious  problem  because 
otner  parallel  prefix  networks  could  have  been  used  (and  incorporated  into  TransConS), 
and  because  a  three-  or  five-inverter  clock  [ConMead-801  can  easily  be  included  on  the  chip  if 
necessary. 


§§3.:>.2  Data  Path  Width  Reduction 

To  reduce  the  width  of  the  data  paths  and  still  use  a  parallel  prefix  network,  an  associative 
operation  with  constant  range  and  domain  must  be  used. 

Now  either  maxo  ^  ^  i[mazlandj]=  maxo  ^  i-fifmaxlandy]  or  maxo  ^y^  ,+  i[mazlandy]=i+ 
1,  a:  d  similarly  for  mazlnor.  A  case  analysis  could  show  that  we  would  have  the  following  table; 


maxandi  >  maznort 

Jl  ond,-+i  nor,+  i 

true 

false 

and 

true 

true 

nor 

false 

false 

both’ 

true 

true 

neither 

true 

false 

(*tb  s  is  impossible  but  knowledge  of  this  fact  is  unnecessary  for  the  argument) 

The  effect  of  and, -4.1,  norj^i,  and,>2  and  nor,-4.3  on  the  truth  of  maxandi+a  >  maxnori^a 
give  I  maxandi  >  maznori  can  also  be  summarized  below.  (Here  the  impossible  combinations 
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3  Binary  Aooiticn 


3.3.  RipplB'Carry  and  Bit  Sbrial  Circuits 


h  ive  been  omitted  for  brevity.) 


tnazandi  >  maznori  => 

11  and, +  1,  nofj+i,  ond.+j,  nof<+2  (J. 

true 

false  1 

none 

true 

false 

and,+  i 

true 

true 

nor,+i 

false 

false 

<indi^2 

true 

true 

and, +2,  and, +1 

true 

true 

and,+2.nor,+, 

true 

true 

narj.(.2 

false 

false 

nof<+2,and<+i 

false 

false 

nori+2i  nofi+i 

false 

false 

We  see^Use  of  this  form  of  reasoning  is  justified  by  the  properties  of  max,  that  the  value 
o'  a  max  expression  depends  on  a  single  extreme  element  that  each  string  of  input  bit 
pairs  is  an  operator  that  can  do  one  of  three  things:  it  can  act  like  a  single  pair  of  bits 
both  of  which  are  true  (called  (and)  below,  like  a  single  pair  of  bits  both  of  which  are 
f."»l3e  (called  (nor)),  or  it  can  act  like  a  pair  one  of  which  is  true  (called  (other)).  The  bi¬ 
nary  operator  ©=^*.v[if  y=(and)  then  (ond)  elseif  y=(nor)  then  (nor)  else  x]  is  associa¬ 
tive,  and  that  if  the  identity  of  this  operator  is  considered  to  be  (other)  then  maxandi  > 
moa:nor,=(0o2  ,=(and)).  This  is  precisely  what  was  needed:  an  operator,  amenable  to 
parallel  prefix  computation,  with  finite  range  and  domain. 

Use  of  a  specification  based  on  this  operator  will  yield  a  network  similar  to  Figure  3,  except  that 
there  will  only  be  a  single  parallel  prefix  tree,  each  bit’s  carry  will  be  used  directly  rather  than 
computed  from  the  two  parallel  prefix  trees,  and  (of  course)  the  widths  of  the  data  paths  and 
the  sized  of  the  nodes  will  be  smaller. 


§.3.3  Ripple-carry  and  Bit  Serial  Circuits 

Consider  our  “standard  specification*  of  Figure  1.  If  we  app  the  quantifier  levelling  of 
Subsection  4.1,  we  get: 


V0<t<TI— 1 

c,=o,  ®  4,-  ®  max[o,-  A  4-j  >  max['>-(a*  V  4*)] 
><•  *<«■ 


We  repeat  the  reasoning  for  representing  mazoSj<»+i(P(j:)l  in  terms  of  maio < ,•  < ,[P(i)l 
and  P(t  1)  (see  Subsection  4.2).  We  also  apply  the  next  argument  of  that  Subsection  ,  giving 
a  recurrence  for  the  max . . .  >  max . . .  expression.  That  expression  has  a  single  free  variable, 
i  and  we  will  call  its  value  eorrj/,. 


3.  Bi-iary  Addition 


3.S.  Rip«lb>carry  and  Bit  Serial  Circuits 


CoDoider  the  recurrence  eorryo=falM  and  carry, +t=(carry,-  A  (o,-  V  6,))  V  (a,-  A  6i)^More 
prec  selj,  corryi+i=:if  a,-  A  4,-  then  true  elseif  (a,  v  4,)  then  Mie  else  carry,.  .  This  leads 
immediately  to  the  grade  school  specification  of  Figure  2. 

Using  the  techniques  of  TransConS  (assigning  a  processor  to  each  Talue,  developing  an 
interconnection  graph,  and  specifying  the  appropriate  work  for  each  processor),  we  immediately 
get  t  he  ripple-carry  unit  shown  below. 


*.  *•  ».  ».  A.  c;  a\ 

Figure  5.  Ripple  Carry  Parallel  Structure 

A  te:hnique  called  aggregation  [King-83]  is  applicable.  This  technique  replaces  a  related  series 
of  pi  ocessing  elements  by  a  single  element  that  receives  a  series  of  related  data.  The  circuit  of 
Figure  4  is  an  indexed  series  of  identical  modules,  and  identifying  corresponding  nodes  of  the 
scries  of  modules  gives  the  bit  serial  addition  circuit  shown  below. 

, - ® - »  (  ^ 

iTT  "• 


Cj  A,  <,  A, 

Figure  6.  Serial  Adder 


"I 

•e,  »,  8.  ■ 
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